Open Music Generation Without a Subscription
Suno and Udio produce impressive music but are closed APIs with usage limits and licensing ambiguity. MusicGen from Meta's FAIR team is fully open: Apache 2.0 license, self-hostable on a single GPU, and capable of high-quality instrumental generation from text prompts or melody conditioning.
Architecture: EnCodec + Transformer
MusicGen uses a two-stage approach:
-
EnCodec — Meta's audio codec model converts raw audio waveforms into discrete tokens across multiple codebooks (4–8 codebooks at different quality levels). A 30-second clip at 32 kHz becomes a sequence of approximately 1500 tokens per codebook.
-
Transformer decoder — an autoregressive model generates the EnCodec token sequence conditioned on text embeddings (from a frozen T5 encoder). The model generates all codebooks in a single forward pass using a "delay pattern" that interleaves tokens from different codebooks.
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import torch
import scipy.io.wavfile
processor = AutoProcessor.from_pretrained("facebook/musicgen-stereo-large")
model = MusicgenForConditionalGeneration.from_pretrained(
"facebook/musicgen-stereo-large",
torch_dtype=torch.float16,
).to("cuda")
descriptions = [
"upbeat jazz piano trio, 120 BPM, walking bass, brushed drums, no vocals",
"ambient electronic, slow evolving pads, 60 BPM, cinematic, minor key",
]
inputs = processor(text=descriptions, padding=True, return_tensors="pt").to("cuda")
# Generate 15 seconds of audio (256 tokens ≈ 5 seconds, so 750 for 15s)
audio_values = model.generate(**inputs, max_new_tokens=750)
# Save as WAV
sampling_rate = model.config.audio_encoder.sampling_rate # 32000
for i, audio in enumerate(audio_values.cpu().numpy()):
scipy.io.wavfile.write(f"output_{i}.wav", rate=sampling_rate, data=audio.T)
Melody Conditioning
MusicGen can continue or harmonize with an existing melody:
from transformers import AutoProcessor, MusicgenForConditionalGeneration
import torchaudio
processor = AutoProcessor.from_pretrained("facebook/musicgen-melody")
model = MusicgenForConditionalGeneration.from_pretrained("facebook/musicgen-melody").to("cuda")
# Load reference melody
melody_waveform, sr = torchaudio.load("melody_reference.wav")
inputs = processor(
audio=melody_waveform.numpy(),
sampling_rate=sr,
text=["orchestral arrangement, strings, cinematic"],
return_tensors="pt",
padding=True,
).to("cuda")
audio_values = model.generate(**inputs, max_new_tokens=500)
Model Sizes and Quality Trade-offs
| Variant | Parameters | VRAM | Quality | |---------|-----------|------|---------| | musicgen-small | 300M | ~4GB | Adequate for prototyping | | musicgen-medium | 1.5B | ~8GB | Good for most use cases | | musicgen-large | 3.3B | ~16GB | High quality | | musicgen-stereo-large | 3.3B | ~16GB | Stereo output, recommended |
MusicGen vs Suno/Udio
Suno and Udio generate vocals and lyrics in addition to instrumentals — MusicGen is instrumental only. On pure instrumental quality, MusicGen large is competitive with early Suno but falls short of Suno v3/v4. The trade-offs: MusicGen is open, self-hostable, and has no usage limits or licensing concerns about using generated music commercially.
Real Use Cases
Background music for YouTube videos (avoiding copyright claims), game soundtrack generation for indie developers, podcast intro/outro generation, and rapid music prototyping for composers who want to hear a harmonic idea before writing notation.