TLPO: Token-Level Policy Optimization for Mitigating Language Confusion in Large Language Models
Jinho Choo, JunSeung Lee, Jimyeong Kim, Yeeho Song, S. K. Hong, Yeong-Dae Kwon

TL;DR
TLPO is a fine-tuning method that reduces language confusion in multilingual LLMs by applying token-level updates, improving language consistency without harming overall performance.
Contribution
The paper introduces TLPO, a novel token-level policy optimization framework that effectively mitigates language confusion in multilingual models with minimal impact on general capabilities.
Findings
TLPO outperforms existing methods in language consistency across multiple languages.
TLPO maintains downstream task accuracy while reducing language confusion.
Token-level updates enable more precise mitigation compared to sequence-level approaches.
Abstract
Large language models (LLMs) demonstrate strong multilingual capabilities, yet often fail to consistently generate responses in the intended language, exhibiting a phenomenon known as language confusion. Prior mitigation approaches based on sequence-level fine-tuning, such as DPO, ORPO, and GRPO, operate at the level of entire responses and can lead to unintended degradation of general model capabilities, motivating the need for more fine-grained alternatives. To address this, we introduce Token-Level Policy Optimization (TLPO), a fine-tuning framework designed to mitigate language confusion through localized, token-level updates. TLPO identifies error-prone positions, explores alternative candidate tokens, and updates the policy using a tailored objective to suppress error-inducing outputs at a granular level. This selective intervention enables effective mitigation of language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
