Speaker diarization with Whisper: Workarounds and tips
Yuki TanakaFeb 18, 2025
Whisper doesn't natively support speaker diarization, but here's my pipeline that combines Whisper with pyannote for speaker identification:
from pyannote.audio import Pipeline
from openai import OpenAIStep 1: Diarization with pyannote
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
diarization = pipeline("meeting.wav")Step 2: Transcription with Whisper
client = OpenAI()
transcript = client.audio.transcriptions.create(
model="whisper-1",
file=open("meeting.wav", "rb"),
response_format="verbose_json",
timestamp_granularities=["segment"]
)Step 3: Align diarization with transcription segments
for segment in transcript.segments:
speaker = get_speaker_at_time(diarization, segment.start)
print(f"[{speaker}] {segment.text}")
Accuracy is ~90% for 2-3 speakers, degrades with more. Would love to see native diarization in the Whisper API!
5.1k views28 replies72 likes
Log in to reply to this topic.