v3 vs v2: What Actually Improved
OpenAI's Whisper Large v3 vs Large v2 improvements are measurable across Common Voice benchmarks. WER (word error rate) dropped by 10-20% on most languages, with the largest improvements on lower-resource languages. The model was trained on 5 million hours of audio — roughly double v2 — with better filtering to remove machine-generated transcripts from training data.
Key additions in v3:
- Word-level timestamps (previously only segment-level)
- Better handling of code-switching (mid-sentence language switching)
- Reduced hallucination on silence/music segments
- Improved number and proper noun accuracy
99 Languages and Accuracy
HuggingFace's Whisper Large v3 supports 99 languages with wildly varying accuracy. For English, Spanish, French, German, Japanese, and Mandarin, WER is under 5% on clean audio. For lower-resource languages, expect 15-30% WER. Always benchmark on your specific audio domain before production deployment.
Python Transcription With faster-whisper
The faster-whisper library provides 4x speed improvement over the original Whisper through CTranslate2 backend:
from faster_whisper import WhisperModel
model = WhisperModel(
"large-v3",
device="cuda",
compute_type="float16"
)
segments, info = model.transcribe(
"audio.mp3",
beam_size=5,
word_timestamps=True,
language="en"
)
print(f"Detected language: {info.language} ({info.language_probability:.2f})")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
for word in segment.words:
print(f" Word: '{word.word}' at {word.start:.2f}s")
Speaker Diarization With pyannote
Combine Whisper with pyannote.audio for "who said what when" transcripts:
from pyannote.audio import Pipeline
from faster_whisper import WhisperModel
# Requires HuggingFace token and model access approval
diarize_pipeline = Pipeline.from_pretrained(
"pyannote/speaker-diarization-3.1",
use_auth_token="YOUR_HF_TOKEN"
)
whisper_model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# Run diarization
diarization = diarize_pipeline("audio.wav")
# Run transcription with word timestamps
segments, _ = whisper_model.transcribe("audio.wav", word_timestamps=True)
# Align speakers to words — see pyannote documentation for full alignment logic
for turn, _, speaker in diarization.itertracks(yield_label=True):
print(f"Speaker {speaker}: {turn.start:.1f}s to {turn.end:.1f}s")
Whisper.cpp for CPU Inference
For environments without GPU access, Whisper.cpp provides pure CPU inference. Large v3 in Q5_0 quantization runs at roughly 10x real-time on an M2 MacBook Pro (1 hour of audio in 6 minutes). The whisper-cpp Python bindings make integration straightforward.
Batch Transcription for Long Audio
For files over 30 minutes, chunk audio into 25-30 second segments with 2-second overlap to avoid cutting mid-word. Whisper's context window is 30 seconds; feeding longer audio causes it to loop or hallucinate. faster-whisper handles this automatically when chunk_length is set appropriately.
Production throughput on a single A10G GPU: approximately 300 minutes of audio per hour with Large v3 at float16.