Language as a Latent Variable for Reasoning Optimization
Linjuan Wu, Haoran Wei, Jialong Tang, Shuang Luo, Baosong Yang, Yongliang Shen, Weiming Lu

TL;DR
This paper explores how language functions as a latent variable influencing reasoning in multilingual models, introducing a novel RL framework that enhances reasoning accuracy across languages without relying on chain-of-thought annotations.
Contribution
It proposes polyGRPO, a reinforcement learning method that leverages language variation as an exploration signal to improve multilingual reasoning performance.
Findings
Non-English responses often outperform English on reasoning tasks.
Unconstrained language conditions yield the best reasoning accuracy.
polyGRPO improves multilingual math problem accuracy by over 6%.
Abstract
As LLMs reduce English-centric bias, a surprising trend emerges: non-English responses sometimes outperform English on reasoning tasks. We hypothesize that language functions as a latent variable that structurally modulates the model's internal inference pathways, rather than merely serving as an output medium. To test this, we conducted a Polyglot Thinking Experiment, in which models were prompted to solve identical problems under language-constrained and language-unconstrained conditions. Results show that non-English responses often achieve higher accuracy, and the best performance frequently occur when language is unconstrained, suggesting that multilinguality broadens the model's latent reasoning space. Based on this insight, we propose polyGRPO (Polyglot Group Relative Policy Optimization), an RL framework that treats language variation as an implicit exploration signal. It…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
