The Multilingual Gap in Instruction Tuning
Large language models are predominantly trained on English data and fine-tuned on English instructions. For high-resource languages like French or German, this matters less — enough internet text exists to provide a reasonable representation. For Swahili, Yoruba, Ukrainian, or Welsh, the gap is severe. Aya 23 from Cohere For AI directly addresses this imbalance.
23 Languages, Including Low-Resource Ones
Aya 23 covers 23 languages: Arabic, Chinese (Simplified/Traditional), Czech, Dutch, English, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Turkish, Ukrainian, and Vietnamese.
The inclusion of Ukrainian is notable given how underrepresented it is in standard instruction datasets. The team also conducted targeted data collection for several African languages that typically appear only in machine-translated form in other multilingual datasets.
The Aya Dataset
The foundation of Aya 23's fine-tuning is the Aya Dataset: 204,000 human-written and human-verified prompt-completion pairs across 65 languages (Aya 23 uses a 23-language subset). Unlike synthetic multilingual datasets generated by translating English examples, these were created by native speakers working in their own languages — capturing idiomatic expressions, cultural context, and language-specific reasoning patterns.
from transformers import AutoTokenizer, AutoModelForCausalLM
model_id = "CohereForAI/aya-23-8B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto", device_map="auto")
# Ukrainian example
messages = [{"role": "user", "content": "Поясніть квантові обчислення простими словами."}]
input_ids = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt").to(model.device)
output = model.generate(input_ids, max_new_tokens=300)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Model Sizes and License
Aya 23 comes in 8B and 35B variants, both under the CC-BY-NC 4.0 license (non-commercial). The 35B model substantially outperforms the 8B on complex reasoning tasks across all 23 languages, while the 8B is practical for deployment on a single A100.
Benchmark Comparisons
Against multilingual competitors on WMT translation and multilingual reasoning:
- Aya 23 8B surpasses mT0-13B and BLOOMZ-7B on multilingual instruction-following
- The 35B variant outperforms Aya 1 (the predecessor) by an average of 6.6% on discriminative and 4.1% on generative tasks
- Performance on low-resource languages shows the largest gains over baselines
Practical Applications
Teams building multilingual customer support, document summarization for international markets, or government services that must serve non-English speakers will find Aya 23 more capable than applying a translation layer around an English-only model. The native instruction-following capability in each language eliminates the compounding errors of translate-then-reason pipelines.