Đã đăng vào khoảng 4 giờ trước 4 phút đọc

Benchmarking speaker diarization in AI note-takers — not just transcription accuracy

When people benchmark an AI meeting note-taker, they measure transcription accuracy — word error rate against a reference transcript. That's the easy half. The hard half, and the one almost nobody scores, is diarization: figuring out who spoke each word. A transcript that's 99% accurate on the words but attributes them to the wrong person is worse than useless in a meeting, because the entire value of meeting notes is "who committed to what."

I ran a controlled test across three note-takers and found the word accuracy was basically a tie — but diarization split them completely. Here's the ML behind why, and how to actually measure it.

Transcription and diarization are different problems

Transcription (ASR) maps audio → text. Diarization maps audio → a timeline of speaker-homogeneous segments: "speaker A from 0.0–4.2s, speaker B from 4.2–9.1s," and so on. The two are orthogonal. You can nail one and fail the other.

The standard way to do diarization from a single audio stream is a pipeline:

Voice activity detection (VAD) — find the segments that contain speech.
Speaker embeddings — turn each segment into a fixed-length vector (x-vectors, d-vectors, ECAPA-TDNN) that captures voice identity, not content.
Clustering — group the embeddings (agglomerative or spectral clustering) so each cluster ≈ one speaker.

It's hard for reasons that don't affect plain transcription: you usually don't know the number of speakers in advance, short turns give the embedder little to work with, similar voices collapse into one cluster, and overlapping speech (two people at once) breaks the "one segment, one speaker" assumption outright.

The metric is Diarization Error Rate (DER):

DER = (false_alarm + missed_speech + speaker_confusion) / total_speech_time

— time scored as speech that wasn't, speech that was missed, and (the interesting term) time attributed to the wrong speaker.

You can't score diarization without ground truth — so generate it

The same trick that makes synthetic audio good for transcription benchmarks makes it great for diarization: when you build the clip, you already know who said every line. My test clip is generated turn-by-turn, each utterance bound to a specific synthetic voice, so the script is the reference diarization:

SARAH=21m00Tcm4TlvDq8ikWAM   # Rachel (female)
DAVID=pNInz6obpgDQGcFmaJgB   # Adam (male)

gen $SARAH t00.mp3 "Morning, David. Before we start, did the Q3 churn numbers come in?"
gen $DAVID t01.mp3 "They did. We closed at five point two percent monthly churn..."
gen $SARAH t02.mp3 "That's a big drop. Which cohort moved the most?"
# ...alternating turns, each tied to a known voice...

After concatenating the turns (with a short silence between them so the VAD has clean boundaries), I have an 80-second two-speaker meeting and a ground-truth (speaker, text) sequence I typed myself. Now I can grade not just what each tool transcribed, but whether it attributed each turn to the right speaker — the thing DER is built to measure.

(I didn't compute a formal DER here — on a clean, non-overlapping two-voice clip the result is essentially binary, and that turned out to be the whole story.)

The result: same words, very different "who"

All three tools transcribed the words well — one clocked ~99% word accuracy, and the differences in text were small. Diarization was not close:

The bot-based tool (Otter) labeled both speakers correctly. It produced a verbatim transcript split cleanly into the two speakers, and every turn was attributed to the right one. (It wasn't flawless on text — it collapsed "Q3"/"Q2" into a bare "Q" — but the who was right.)
The bot-free tool (Granola) returned a single unlabeled stream. On an ad-hoc capture it told me up front it "won't know who is speaking," and the transcript bore that out: no Speaker 1 / Speaker 2, just one merged column of text.

That gap isn't a quality difference, it's an architecture difference, and it maps straight onto the pipeline above. A tool that joins the call as a bot can often get per-participant audio (or at least the platform's speaker events) — which means it barely has to do diarization; the labels come for free from separate channels. A bot-free tool that captures your device's system audio gets one mixed stream, so it's stuck solving the full VAD → embed → cluster problem from scratch, and on a short ad-hoc capture it punts. The input shape decides the outcome before the model ever runs.

This is exactly why, in the full hands-on Otter review, clean speaker-labeled diarization is the feature I weigh most heavily in its favor — it's the genuine thing Otter does that the bot-free tools structurally can't on an impromptu capture, and it's load-bearing for anyone who needs a record of who-said-what rather than just a wall of text. (It doesn't save Otter from a stingy free tier and a consent lawsuit that pull its overall score down, but the engine underneath is real.)

Takeaways if you're evaluating speech tools

Score diarization separately from transcription. WER tells you nothing about attribution. If your use case is meetings, interviews, or any multi-speaker audio, a 99% WER with broken speaker labels is a failing grade.
Generate your reference. Hand-labeling diarization ground truth on real audio is brutal; synthesizing the clip gives you perfect (speaker, segment) labels for free, and lets you plant the hard cases (short turns, similar voices, deliberate overlap) on purpose.
Mind the capture architecture before the model. Whether a tool gets per-speaker streams or one mixed stream determines whether diarization is trivial or the hard clustering problem — often more than the model quality does.
Watch overlapping speech. Clean alternating turns (like my test) are the friendly case. Real meetings have crosstalk, which is where the speaker_confusion term in DER explodes — benchmark that separately before you trust any tool on a messy call.

If you've built a diarization eval that handles overlap well, or a better way to generate overlap-heavy ground truth than splicing synthetic turns, I'd like to read it in the comments.

Transcription and diarization are different problems

You can't score diarization without ground truth — so generate it

The result: same words, very different "who"

Takeaways if you're evaluating speech tools

Mục lục