Understanding and Mitigating Language Confusion in LLMs
Kelly Marchisio, Wei-Yin Ko, Alexandre B\'erard, Th\'eo Dehaze,, Sebastian Ruder

TL;DR
This paper introduces the Language Confusion Benchmark to evaluate LLMs' ability to generate text in the correct language, revealing widespread confusion especially in complex prompts, and explores mitigation strategies.
Contribution
The paper presents the first comprehensive benchmark for language confusion in LLMs and analyzes factors affecting language accuracy, along with mitigation techniques.
Findings
LLMs often fail to generate text in the correct language.
Language confusion increases with prompt complexity and sampling temperature.
Few-shot prompting and multilingual fine-tuning reduce language confusion.
Abstract
We investigate a surprising limitation of LLMs: their inability to consistently generate text in a user's desired language. We create the Language Confusion Benchmark (LCB) to evaluate such failures, covering 15 typologically diverse languages with existing and newly-created English and multilingual prompts. We evaluate a range of LLMs on monolingual and cross-lingual generation reflecting practical use cases, finding that Llama Instruct and Mistral models exhibit high degrees of language confusion and even the strongest models fail to consistently respond in the correct language. We observe that base and English-centric instruct models are more prone to language confusion, which is aggravated by complex prompts and high sampling temperatures. We find that language confusion can be partially mitigated via few-shot prompting, multilingual SFT and preference tuning. We release our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗shisa-ai/shisa-v2.1-unphi4-14bmodel· 387 dl· ♡ 4387 dl♡ 4
- 🤗shisa-ai/shisa-v2.1-lfm2-1.2bmodel· 78 dl· ♡ 278 dl♡ 2
- 🤗shisa-ai/shisa-v2.1-llama3.2-3bmodel· 14 dl· ♡ 214 dl♡ 2
- 🤗shisa-ai/shisa-v2.1-qwen3-8bmodel· 1.4k dl· ♡ 71.4k dl♡ 7
- 🤗shisa-ai/shisa-v2.1-llama3.3-70bmodel· 18 dl· ♡ 818 dl♡ 8
- 🤗XpressAI/shisa-v2.1-lfm2-1.2b-GGUFmodel· 174 dl174 dl
- 🤗XpressAI/shisa-v2.1-llama3.2-3b-GGUFmodel· 12 dl12 dl
- 🤗XpressAI/shisa-v2.1-qwen3-8b-GGUFmodel· 37 dl37 dl
- 🤗XpressAI/shisa-v2.1-unphi4-14b-GGUFmodel· 141 dl141 dl
Videos
Taxonomy
TopicsInterpreting and Communication in Healthcare · Translation Studies and Practices
MethodsBalanced Selection · Shrink and Fine-Tune · LLaMA
