The SmolLM2 Family
SmolLM2 ships three model sizes: 135M, 360M, and 1.7B parameters — all instruction-tuned variants available alongside the base models. The key question for tiny LLMs isn't just parameter count; it's training data quality.
SmolLM2 was trained on:
- FineWeb-Edu: 15T tokens of educational web content, filtered for learning value
- DCLM: Curated CommonCrawl with quality scoring
- The Stack: Permissively licensed code from GitHub
This data mixture explains why SmolLM2-135M outperforms models trained on raw CommonCrawl at 3-4x the parameter count. Quality filtering matters more than scale at the small model regime.
WebGPU Inference in the Browser
Transformers.js runs SmolLM2 directly in the browser via WebGPU:
import { pipeline } from "@xenova/transformers";
const generator = await pipeline(
"text-generation",
"HuggingFaceTB/SmolLM2-135M-Instruct",
{ device: "webgpu" }
);
const result = await generator(
"<|im_start|>user
Explain recursion in one sentence.<|im_end|>
<|im_start|>assistant
",
{ max_new_tokens: 100, do_sample: false }
);
console.log(result[0].generated_text);
The 135M model downloads as ~270MB (int8 quantized ONNX). After first load and caching, inference runs entirely client-side with no API calls. WebGPU support is available in Chrome 113+ and Edge 113+.
The HuggingFaceTB/SmolLM2-135M-Instruct Model
The chat-tuned variant follows the ChatML instruction format and handles:
- Single-turn Q&A
- Text classification (prompt-based)
- Short summarization (under 200 words)
- Code completion for common patterns
It does not reliably handle: multi-step reasoning, complex instruction chains, or responses requiring factual knowledge beyond its training cutoff.
On-Device Use Cases
Autocomplete: At 135M parameters, generation is fast enough for real-time text completion with 50-100ms latency on mid-range mobile hardware.
Classification: Few-shot classification via prompting for 3-5 categories works reliably without fine-tuning.
Summarization: Extractive-style summaries of short documents (under 1000 tokens) with good factual preservation.
Comparison to Phi-3-mini
| Metric | SmolLM2-1.7B | Phi-3-mini (3.8B) | |---|---|---| | Parameters | 1.7B | 3.8B | | Browser deployable | Yes (Transformers.js) | Marginal | | MMLU (5-shot) | ~45% | ~68% | | HumanEval (code) | ~25% | ~58% | | On-device latency | Very fast | Moderate |
SmolLM2-1.7B is not competitive with Phi-3-mini on reasoning benchmarks. The tradeoff is deployment flexibility — SmolLM2 runs in contexts where Phi-3-mini cannot.
Quantization Options
All SmolLM2 variants are available in int8 and int4 quantization through bitsandbytes and GGUF. The 135M model at int4 is 68MB — small enough to bundle in mobile apps or serve from a CDN.