Whisper is the best open source speech-to-text model. Run locally with faster-whisper or free via Groq. Here is how to integrate transcription in Python and Node.js with honest tradeoffs.
OpenAI Whisper is the best open source speech-to-text model available as of 2026. Released by OpenAI in 2022 and continuously improved, Whisper achieves near-human accuracy on English speech and competitive accuracy on 99+ other languages. The original implementation is slow, but faster-whisper (based on CTranslate2) runs the same models 4-10x faster with lower memory usage. For free, high-speed cloud transcription, Groq offers a Whisper API endpoint that processes audio faster than real-time at no cost (with rate limits). For production transcription at scale, the choice between local faster-whisper and cloud APIs depends primarily on your audio volume and latency requirements.
Whisper Model Sizes
Whisper comes in five sizes with different quality and speed trade-offs:
Model
Parameters
VRAM
Speed (CPU)
WER (English)
tiny
39M
~1GB
~32x realtime
~5.7%
base
74M
~1GB
~16x realtime
~4.7%
small
244M
~2GB
~6x realtime
~3.4%
medium
769M
~5GB
~2x realtime
~3.0%
large-v3
1550M
~10GB
~1x realtime
~2.7%
WER = Word Error Rate on English (lower is better). For most transcription use cases, medium or large-v3 is the right choice. For real-time applications on limited hardware, small is the practical option.
Running Locally with faster-whisper
faster-whisper is a reimplementation of Whisper using CTranslate2, an optimized inference engine. It is 4-10x faster than the original Whisper and uses less memory with the same accuracy.
Installation:
pip install faster-whisper
Basic transcription:
from faster_whisper import WhisperModel
# Load model (first run downloads it from HuggingFace)
model = WhisperModel("large-v3", device="cuda", compute_type="float16")
# Transcribe
segments, info = model.transcribe("audio.mp3", beam_size=5)
print(f"Detected language: {info.language} (probability: {info.language_probability:.2f})")
for segment in segments:
print(f"[{segment.start:.2f}s -> {segment.end:.2f}s] {segment.text}")
For CPU-only environments:
model = WhisperModel("medium", device="cpu", compute_type="int8")
The compute_type="int8" quantization halves memory usage with minimal accuracy loss on CPU.
// stay current
AI & ML insights, weekly
Practical deep-dives on LLMs, developer tools, and AI engineering. No filler. Unsubscribe any time.
// written byFIG. AUTH-01
530
Mahmudul Haque Qudrati
CEO & ML Engineer
CEO and ML Engineer at Pristren. Builds AI-powered software for teams and writes about machine learning, LLMs, developer tools, and practical AI applications.
Groq offers Whisper-large-v3 via their API, running on their LPU (Language Processing Unit) hardware. It is extremely fast (transcribes 1 hour of audio in ~5 seconds) and free within their rate limits (2,000 minutes of audio per day on the free tier as of early 2026).
from groq import Groq
client = Groq()
with open("audio.mp3", "rb") as file:
transcription = client.audio.transcriptions.create(
file=(audio.mp3, file.read()),
model="whisper-large-v3",
language="en",
response_format="verbose_json",
timestamp_granularities=["word", "segment"]
)
print(transcription.text)
Groq Whisper returns word-level timestamps when requested, which is useful for applications that need to align transcription with video or highlight specific moments.
Integration in Node.js
For Node.js applications, the options are:
Via Groq SDK:
import Groq from "groq-sdk";
import fs from "fs";
const groq = new Groq();
const transcription = await groq.audio.transcriptions.create({
file: fs.createReadStream("audio.mp3"),
model: "whisper-large-v3",
language: "en",
});
console.log(transcription.text);
Via OpenAI SDK (OpenAI also offers Whisper as an API):
import OpenAI from "openai";
import fs from "fs";
const openai = new OpenAI();
const transcription = await openai.audio.transcriptions.create({
file: fs.createReadStream("audio.mp3"),
model: "whisper-1",
});
OpenAI's Whisper API costs $0.006/minute. Groq's is free within rate limits. For volume above Groq's free tier, running faster-whisper locally is cheaper than OpenAI's API.
Accuracy by Language and Audio Quality
Whisper-large-v3 English accuracy is excellent for clear audio: 2.7% WER, which means about 1 error per 37 words. In practice, this is better than most human transcriptionists for standard speech.
Accuracy degrades with: heavy accents (though it handles most accents well), technical jargon not present in training data, noisy audio (background music, crowd noise, recording artifacts), and overlapping speakers.
For non-English languages, large-v3 handles French, German, Spanish, Japanese, Chinese, and other major languages well. Accuracy drops significantly for low-resource languages.
Use Cases
Meeting transcription: Record meetings and transcribe with speaker diarization (requires additional tools like pyannote for speaker detection). This is the primary use case for Zlyqor's AI meeting summaries feature.
Voice notes: Transcribe voice memos for search and organization.
Video captioning: Generate captions for video content. Use word-level timestamps to synchronize captions with video.
Call center analytics: Transcribe support calls for quality analysis and intent detection.
Local faster-whisper on g4dn.xlarge (T4 GPU, AWS): ~$120/month server cost for always-on, but transcribes 10,000 minutes in hours, so spot instances at ~$0.12/hour work out to $5-15/month for batch processing
For batch transcription workloads, local faster-whisper on spot GPU instances is significantly cheaper than API options. For real-time or latency-sensitive applications, Groq is the best price-to-performance option.
Best Practices for Production
Audio preprocessing: Resample to 16kHz mono WAV for best results. Whisper expects 16kHz sampling rate.
Chunking: For long audio, split into 30-second chunks with overlap to avoid truncation. faster-whisper handles this automatically.
Language detection: Use Whisper's built-in language detection for multilingual audio, but specify language when known for better accuracy.
Prompting: Provide a prompt with domain-specific terms to improve accuracy on jargon (e.g., "The following is a medical transcription" for medical audio).
Error handling: Implement retries with exponential backoff for API calls, and fallback to local model if API is unavailable.
Pristren builds AI-powered software for teams. Zlyqor is our all-in-one workspace - chat, projects, time tracking, AI meeting summaries, and invoicing - in one tool. Try it free.
Frequently Asked Questions
What is OpenAI Whisper: The Complete Guide for Developers?
OpenAI Whisper is an open source speech-to-text model released by OpenAI in 2022. It achieves near-human accuracy on English speech and supports 99+ languages. This guide covers model sizes, local deployment with faster-whisper, cloud API usage via Groq (free) and OpenAI (paid), integration in Python and Node.js, accuracy considerations, use cases, and cost comparisons for developers.
How does OpenAI Whisper: The Complete Guide for Developers work?
Whisper uses a transformer-based encoder-decoder architecture trained on 680,000 hours of multilingual audio. It processes audio in 30-second chunks, converts them to log-Mel spectrograms, and generates text transcripts with timestamps. The guide explains how to run it locally using faster-whisper (optimized CTranslate2 backend) or via cloud APIs like Groq and OpenAI.
What are the best practices for OpenAI Whisper: The Complete Guide for Developers?
Best practices include: resampling audio to 16kHz mono WAV, using appropriate model size (medium or large-v3 for accuracy, small for speed), enabling int8 quantization on CPU, providing domain-specific prompts to reduce jargon errors, implementing retry logic for API calls, and chunking long audio with overlap. For production, consider using spot GPU instances for batch processing to minimize cost.
How much does OpenAI Whisper: The Complete Guide for Developers cost?
Costs vary by deployment: OpenAI Whisper API charges $0.006 per minute of audio. Groq offers free transcription within rate limits (2,000 minutes/day as of 2026). Running faster-whisper locally on a GPU instance (e.g., AWS g4dn.xlarge spot) costs about $5-15/month for 10,000 minutes of batch processing. The guide provides a detailed cost comparison to help developers choose the most economical option.
Is OpenAI Whisper: The Complete Guide for Developers worth it in 2026?
Yes, Whisper remains the best open source speech-to-text model in 2026 due to its high accuracy (2.7% WER on English), multilingual support, and multiple deployment options. The guide helps developers evaluate tradeoffs: local faster-whisper for cost-effective batch processing, Groq for free real-time transcription, and OpenAI API for ease of use. It's worth it for any developer needing reliable transcription without vendor lock-in.