Whisper API vs local Whisper model: latency and accuracy comparison

Yuki Tanaka
Yuki TanakaSep 5, 2024

I've benchmarked the Whisper API against running whisper-large-v3 locally for our podcast transcription service. Here are the results.

Test setup

  • 100 podcast episodes (30-90 min each)
  • English language
  • Various audio qualities (studio to phone recordings)
  • Results

    | Metric | API (whisper-1) | Local (large-v3) | |--------|----------------|------------------| | WER (studio) | 4.2% | 3.8% | | WER (noisy) | 8.7% | 7.1% | | Latency (1hr audio) | 45s | 12min (RTX 4090) | | Cost (1hr audio) | $0.36 | ~$0.02 (electricity) |

    The API is significantly faster but the local model is more accurate, especially on noisy audio. At scale (1000+ hours/month), local is much cheaper.

    We ended up using the API for real-time transcription and local for batch processing.

    6.8k views36 replies95 likes

    Log in to reply to this topic.