Synthetic Data at Scale
OpenHermes 2.5 is Nous Research's flagship instruction model, built on Mistral 7B and trained on approximately one million synthetic conversations generated primarily by GPT-4. It represents one of the clearest demonstrations that data curation strategy matters more than raw dataset size.
What Makes the Data Different
Most instruction datasets contain tens of thousands of examples. OpenHermes 2.5 used around 1,000,000 — but the volume alone is not the story. Nous Research curated the data aggressively: deduplication, quality filtering, topic diversity across coding, reasoning, roleplay, analysis, and instruction-following, and removing examples where GPT-4's output was clearly low-effort or off-topic.
The result is a model that handles highly varied system prompts without degradation — a common failure mode in smaller instruction models trained on narrow distributions.
ChatML Format
OpenHermes 2.5 uses the ChatML prompt format, which provides clean structure for multi-turn dialogue:
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "teknium/OpenHermes-2.5-Mistral-7B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto", device_map="auto")
prompt = """<|im_start|>system
You are a helpful coding assistant. Be concise.<|im_end|>
<|im_start|>user
Write a Python function to flatten a nested list.<|im_end|>
<|im_start|>assistant
"""
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=300, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Running With Ollama
For local deployment, Ollama provides a quantized version:
ollama pull openhermes
ollama run openhermes "Explain the CAP theorem in simple terms."
GPQA and Benchmark Position
On GPQA (Graduate-Level Google-Proof Q&A), OpenHermes 2.5 scores competitively for its size — a benchmark specifically designed to be hard for models trained on internet data, requiring actual reasoning rather than pattern matching. The model consistently ranks in the top tier of open 7B instruction models across coding benchmarks like HumanEval and reasoning benchmarks like ARC-Challenge.
System Prompt Flexibility
One of the model's practical strengths is how well it responds to varied system prompts. Unlike models trained on narrow chat formats, OpenHermes 2.5 reliably adopts personas, follows domain-specific constraints, and maintains instruction-following across long multi-turn sessions. This makes it particularly useful for roleplay applications, domain-specific assistants, and structured output generation.
Data Volume vs. Data Quality
The lesson from OpenHermes 2.5 is nuanced: 1M examples worked here because they were diverse and filtered, not simply because of the count. Teams attempting to replicate this approach should budget more time for data curation than for training — the training run itself is relatively cheap on modern hardware.