Speaker diarization with Whisper: Workarounds and tips

Yuki Tanaka
Yuki TanakaFeb 18, 2025

Whisper doesn't natively support speaker diarization, but here's my pipeline that combines Whisper with pyannote for speaker identification:

from pyannote.audio import Pipeline
from openai import OpenAI

Step 1: Diarization with pyannote

pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1") diarization = pipeline("meeting.wav")

Step 2: Transcription with Whisper

client = OpenAI() transcript = client.audio.transcriptions.create( model="whisper-1", file=open("meeting.wav", "rb"), response_format="verbose_json", timestamp_granularities=["segment"] )

Step 3: Align diarization with transcription segments

for segment in transcript.segments: speaker = get_speaker_at_time(diarization, segment.start) print(f"[{speaker}] {segment.text}")

Accuracy is ~90% for 2-3 speakers, degrades with more. Would love to see native diarization in the Whisper API!

5.1k views28 replies72 likes

Log in to reply to this topic.