OpenAI Whisper is the best open source speech-to-text model available as of 2026. Released by OpenAI in 2022 and continuously improved, Whisper achieves near-human accuracy on English speech and competitive accuracy on 99+ other languages. The original implementation is slow, but faster-whisper (based on CTranslate2) runs the same models 4-10x faster with lower memory usage. For free, high-speed cloud transcription, Groq offers a Whisper API endpoint that processes audio faster than real-time at no cost (with rate limits). For production transcription at scale, the choice between local faster-whisper and cloud APIs depends primarily on your audio volume and latency requirements.
Whisper Model Sizes
Whisper comes in five sizes with different quality and speed trade-offs:
| Model | Parameters | VRAM | Speed (CPU) | WER (English) | |-------|-----------|------|-------------|----------------| | tiny | 39M | ~1GB | ~32x realtime | ~5.7% | | base | 74M | ~1GB | ~16x realtime | ~4.7% | | small | 244M | ~2GB | ~6x realtime | ~3.4% | | medium | 769M | ~5GB | ~2x realtime | ~3.0% | | large-v3 | 1550M | ~10GB | ~1x realtime | ~2.7% |
WER = Word Error Rate on English (lower is better). For most transcription use cases, medium or large-v3 is the right choice. For real-time applications on limited hardware, small is the practical option.
Running Locally with faster-whisper
faster-whisper is a reimplementation of Whisper using CTranslate2, an optimized inference engine. It is 4-10x faster than the original Whisper and uses less memory with the same accuracy.
Installation:
pip install faster-whisper
Basic transcription:
from faster_whisper import WhisperModel
# Load model (first run downloads it from HuggingFace)
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# Transcribe
segments, info = model.transcribe("audio.mp3", beam_size=5)
print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
For CPU-only environments:
model = WhisperModel("medium", device="cpu", compute_type="int8")
The compute_type="int8" quantization halves memory usage with minimal accuracy loss on CPU.
Using via Groq API (Free and Fast)
Groq offers Whisper-large-v3 via their API, running on their LPU (Language Processing Unit) hardware. It is extremely fast (transcribes 1 hour of audio in ~5 seconds) and free within their rate limits (2,000 minutes of audio per day on the free tier as of early 2026).
from groq import Groq
client = Groq()
with open("audio.mp3", "rb") as file:
transcription = client.audio.transcriptions.create(
file=(audio.mp3, file.read()),
model="whisper-large-v3",
language="en",
response_format="verbose_json",
timestamp_granularities=["word", "segment"]
)
print(transcription.text)
Groq Whisper returns word-level timestamps when requested, which is useful for applications that need to align transcription with video or highlight specific moments.
Integration in Node.js
For Node.js applications, the options are:
Via Groq SDK:
import Groq from "groq-sdk";
import fs from "fs";
const groq = new Groq();
const transcription = await groq.audio.transcriptions.create({
file: fs.createReadStream("audio.mp3"),
model: "whisper-large-v3",
language: "en",
});
console.log(transcription.text);
Via OpenAI SDK (OpenAI also offers Whisper as an API):
import OpenAI from "openai";
import fs from "fs";
const openai = new OpenAI();
const transcription = await openai.audio.transcriptions.create({
file: fs.createReadStream("audio.mp3"),
model: "whisper-1",
});
OpenAI's Whisper API costs $0.006/minute. Groq's is free within rate limits. For volume above Groq's free tier, running faster-whisper locally is cheaper than OpenAI's API.
Accuracy by Language and Audio Quality
Whisper-large-v3 English accuracy is excellent for clear audio: 2.7% WER, which means about 1 error per 37 words. In practice, this is better than most human transcriptionists for standard speech.
Accuracy degrades with: heavy accents (though it handles most accents well), technical jargon not present in training data, noisy audio (background music, crowd noise, recording artifacts), and overlapping speakers.
For non-English languages, large-v3 handles French, German, Spanish, Japanese, Chinese, and other major languages well. Accuracy drops significantly for low-resource languages.
Use Cases
Meeting transcription: Record meetings and transcribe with speaker diarization (requires additional tools like pyannote for speaker detection). This is the primary use case for Zlyqor's AI meeting summaries feature.
Voice notes: Transcribe voice memos for search and organization.
Video captioning: Generate captions for video content. Use word-level timestamps to synchronize captions with video.
Call center analytics: Transcribe support calls for quality analysis and intent detection.
Cost Comparison
At 10,000 minutes of audio per month:
- OpenAI Whisper API: $60/month ($0.006/min)
- Groq (above free tier): $0.111/month ($0.0000111/audio-second)
- Local faster-whisper on g4dn.xlarge (T4 GPU, AWS): ~$120/month server cost for always-on, but transcribes 10,000 minutes in hours, so spot instances at ~$0.12/hour work out to $5-15/month for batch processing
For batch transcription workloads, local faster-whisper on spot GPU instances is significantly cheaper than API options. For real-time or latency-sensitive applications, Groq is the best price-to-performance option.
Keep Reading
- Running Open Source LLMs in Production — Production infrastructure considerations that apply to Whisper too
- Hugging Face Complete Guide — The model hub where Whisper weights are hosted
- How Large Language Models Work — Understanding the transformer architecture behind Whisper
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace — chat, projects, time tracking, AI meeting summaries, and invoicing — in one tool. Try it free.