Self-Supervised Pre-Training
The fundamental problem with ASR for low-resource languages: collecting transcribed speech is expensive. Wav2Vec 2.0 solves this by separating pre-training (unlabeled audio only) from fine-tuning (small labeled set).
Pre-training works through contrastive learning:
- Raw audio is processed through a convolutional feature extractor
- Features are quantized into a discrete codebook of speech units
- A transformer context network predicts quantized representations for masked time steps
- The model learns to distinguish true masked tokens from distractors
This self-supervised objective learns phonetic and acoustic structure from unlabeled speech — recordings of any kind, without transcripts.
Fine-Tuning With 10 Minutes of Labeled Data
After pre-training on unlabeled audio, fine-tuning with CTC (Connectionist Temporal Classification) requires remarkably little labeled data. The fine-tuning guide demonstrates fine-tuning on 960 hours for English, but researchers have achieved workable WER with 10 minutes for low-resource languages.
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torch
import librosa
processor = Wav2Vec2Processor.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
model = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-large-960h-lv60-self")
# Load audio (must be 16kHz)
audio, sr = librosa.load("speech.wav", sr=16000)
input_values = processor(audio, sampling_rate=16000, return_tensors="pt").input_values
with torch.no_grad():
logits = model(input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)
print(transcription[0])
The HuggingFace Wav2Vec2 page hosts the 960h-lv60-self model — the strongest self-supervised English variant.
CTC Decoder With Language Model Integration
Pure CTC decoding outputs the most likely character sequence independently at each timestep. Adding a language model (kenlm) significantly reduces WER by incorporating word-level probability:
from pyctcdecode import build_ctcdecoder
import kenlm
vocab = list(processor.tokenizer.get_vocab().keys())
language_model = kenlm.Model("language_model.arpa")
decoder = build_ctcdecoder(
vocab,
kenlm_model=language_model,
alpha=0.5, # LM weight
beta=1.5, # Word insertion bonus
)
# Decode logits with LM
logits_np = logits[0].numpy()
transcription_lm = decoder.decode(logits_np)
XLSR-53: Multilingual Variant
XLSR-53 pre-trains on 53 languages simultaneously, enabling cross-lingual transfer. Fine-tune XLSR-53 on a low-resource language and it leverages phonetic knowledge from related languages. This is particularly effective for language families — fine-tuning for Catalan benefits from Spanish pre-training data.
Custom Dataset Fine-Tuning
Prepare data as HuggingFace Dataset with audio (16kHz) and text columns. The fine-tuning guide walks through vocabulary building, CTC tokenizer setup, and DataCollatorCTCWithPadding. Training a language-specific model from XLSR-53 requires approximately 4-8 GPU hours on an A10G for a 100-hour dataset.