OpenAI Developer Community

Whisper doesn't natively support speaker diarization, but here's my pipeline that combines Whisper with pyannote for speaker identification:

from pyannote.audio import Pipeline
from openai import OpenAI
Step 1: Diarization with pyannote
pipeline = Pipeline.from_pretrained("pyannote/speaker-diarization-3.1")
diarization = pipeline("meeting.wav")
Step 2: Transcription with Whisper
client = OpenAI()
transcript = client.audio.transcriptions.create(
    model="whisper-1",
    file=open("meeting.wav", "rb"),
    response_format="verbose_json",
    timestamp_granularities=["segment"]
)
Step 3: Align diarization with transcription segments
for segment in transcript.segments:
    speaker = get_speaker_at_time(diarization, segment.start)
    print(f"[{speaker}] {segment.text}")

Accuracy is ~90% for 2-3 speakers, degrades with more. Would love to see native diarization in the Whisper API!

Speaker diarization with Whisper: Workarounds and tips

Step 1: Diarization with pyannote

Step 2: Transcription with Whisper

Step 3: Align diarization with transcription segments