How to detect speakers and understand who said what?
Speaker diarization is the process of automatically detecting multiple speakers in an audio recording and attributing each spoken segment to the correct speaker. This makes it easy to understand who said what in conversations, interviews, meetings, and more.
Diarization is only possibe with Asynchronous Transcription for now
Learn more about diarization in this article.
Why Use Diarization?
Clarity in conversations – Separate speakers in multi-person dialogues.
Accurate transcripts – Assign utterances to the right person.
Search & analytics – Easily navigate and analyze speaker-specific parts of recordings.
Enabling Diarization
To enable speaker diarization in your transcription request, simply include the diarization parameter:
{ "audio_url": "<your audio URL>", "diarization": true }
This will process the audio and annotate each utterance with a speaker index.
Enhanced Diarization
For more robust diarization, especially with challenging audio (e.g., overlapping voices, background noise, varied accents), you can enable the enhanced mode via the diarization_config object:
{ "audio_url": "<your audio URL>", "diarization": true, "diarization_config": { "enhanced": true } }
Response Format
When diarization is enabled, each utterance in the response includes a speaker field.
Speakers are indexed in order of appearance.
Speaker 0 = first person detected
Speaker 1 = second person detected
And so on...
Example Response
{ "transcription": { "utterances": [ { "words": [...], "text": "it says you are trained in technology.", "language": "en", "start": 0.7334, "end": 2.364, "confidence": 0.8914, "channel": 0, "speaker": 0 }, { "words": [...], "text": "yes, that's correct.", "language": "en", "start": 2.500, "end": 3.210, "confidence": 0.912, "channel": 0, "speaker": 1 } ] } }