MusicGen: Meta's Text-to-Music Model That Runs Locally

Meta's MusicGen generates 30-second music clips from text descriptions or melody conditioning using an EnCodec audio tokenizer and autoregressive transformer - fully open and self-hostable.

Mahmudul Haque Qudrati

CEO & ML Engineer

May 15, 2026

7 min read

// tags

#musicgen#meta#music-generation#audio-ai#encodec

FIG. ART-25

7 min read

“

MusicGen: Meta's Text-to-Music Model That Runs Locally

// reading plan

sections

431

words

min read

// Machine Learning

ONNX: Export Any ML Model and Run It Anywhere

ONNX (Open Neural Network Exchange) is the universal model format - export from PyTorch, scikit-learn, or HuggingFace and run 3x faster inference with ONNX Runtime on CPU or GPU.

7 min read

// Machine Learning

Supervised Learning Explained: How Models Learn from Labeled Examples

Open Music Generation Without a Subscription

Suno and Udio produce impressive music but are closed APIs with usage limits and licensing ambiguity. MusicGen from Meta's FAIR team is fully open: Apache 2.0 license, self-hostable on a single GPU, and capable of high-quality instrumental generation from text prompts or melody conditioning.

Architecture: EnCodec + Transformer

MusicGen uses a two-stage approach:

EnCodec - Meta's audio codec model converts raw audio waveforms into discrete tokens across multiple codebooks (4 - 8 codebooks at different quality levels). A 30-second clip at 32 kHz becomes a sequence of approximately 1500 tokens per codebook.
Transformer decoder - an autoregressive model generates the EnCodec token sequence conditioned on text embeddings (from a frozen T5 encoder). The model generates all codebooks in a single forward pass using a "delay pattern" that interleaves tokens from different codebooks.

from transformers import AutoProcessor, MusicgenForConditionalGeneration
import torch
import scipy.io.wavfile

processor = AutoProcessor.from_pretrained("facebook/musicgen-stereo-large")
model = MusicgenForConditionalGeneration.from_pretrained(
    "facebook/musicgen-stereo-large",
    torch_dtype=torch.float16,
).to("cuda")

descriptions = [
    "upbeat jazz piano trio, 120 BPM, walking bass, brushed drums, no vocals",
    "ambient electronic, slow evolving pads, 60 BPM, cinematic, minor key",
]

inputs = processor(text=descriptions, padding=True, return_tensors="pt").to("cuda")

# Generate 15 seconds of audio (256 tokens ≈ 5 seconds, so 750 for 15s)
audio_values = model.generate(**inputs, max_new_tokens=750)

# Save as WAV
sampling_rate = model.config.audio_encoder.sampling_rate  # 32000
for i, audio in enumerate(audio_values.cpu().numpy()):
    scipy.io.wavfile.write(f"output_{i}.wav", rate=sampling_rate, data=audio.T)

Variant	Parameters	VRAM	Quality
musicgen-small	300M	~4GB	Adequate for prototyping
musicgen-medium	1.5B	~8GB	Good for most use cases
musicgen-large	3.3B	~16GB	High quality
musicgen-stereo-large	3.3B	~16GB	Stereo output, recommended

MusicGen: Meta's Text-to-Music Model That Runs Locally

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

Open Music Generation Without a Subscription

Architecture: EnCodec + Transformer

Melody Conditioning

Model Sizes and Quality Trade-offs

MusicGen vs Suno/Udio

Real Use Cases

Links

The workspace your team
actually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

MusicGen: Meta's Text-to-Music Model That Runs Locally

Related Articles

ONNX: Export Any ML Model and Run It Anywhere

Supervised Learning Explained: How Models Learn from Labeled Examples

Open Music Generation Without a Subscription

Architecture: EnCodec + Transformer

Melody Conditioning

Model Sizes and Quality Trade-offs

MusicGen vs Suno/Udio

Real Use Cases

Links

The workspace your teamactually needs

AI & ML insights, weekly

Mahmudul Haque Qudrati

ML Model Evaluation Metrics: Why Accuracy Lies and What to Use Instead

The workspace your team
actually needs