True Openness vs. Open Weights
The word "open" in AI is heavily overloaded. Llama 3 releases model weights with a usage license — the training data, data processing pipeline, and training code remain proprietary. OLMo 2 from Allen AI is different: every component is public. You can download the weights, replicate the training data from Dolma, run the exact training code from the GitHub repository, and verify results with the same evaluation suite.
OLMo 2 Variants
Allen AI released two sizes: OLMo 2 7B (1124 checkpoint) and OLMo 2 13B. Both were trained on Dolma, a 3T-token open dataset built from Common Crawl, Wikipedia, books, GitHub code, and academic papers — all with documented filtering and deduplication pipelines.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "allenai/OLMo-2-1124-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto",
)
inputs = tokenizer("The key difference between supervised and unsupervised learning is", return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7, do_sample=True)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
OLMo-Instruct Fine-Tunes
The base OLMo 2 models are pretrained only — not instruction-tuned. Allen AI also releases OLMo-2-Instruct fine-tunes trained on open instruction datasets, which are more practical for conversational applications while maintaining the full reproducibility guarantee.
Benchmark Performance
OLMo 2 7B outperforms Llama 3.1 8B on several reasoning and knowledge benchmarks (ARC-Challenge, HellaSwag, MMLU) while being comparable on coding. This is notable because Llama 3.1 had access to substantially more compute and a larger, curated (but private) training dataset.
The Dolma Dataset
Dolma is worth examining independently of OLMo. The 3T-token corpus is one of the largest fully documented and reproducible pretraining datasets available:
- Common Crawl (cleaned with CCNet pipeline)
- Wikipedia and Wikibooks (all languages, deduplicated)
- Project Gutenberg (books with expired copyright)
- OpenWebMath (mathematical text from the web)
- RedPajama-v1 GitHub (code across 30+ languages)
- Semantic Scholar (scientific papers)
Why Reproducibility Matters for Research
The scientific value of OLMo 2 is that researchers can run ablation studies on training data composition, compare different tokenization strategies, or test curriculum learning schedules without guessing what Meta or Google did. For AI safety research, interpretability work, and data contamination studies, having the full training pipeline available is essential.