Multilingual Test-Time Scaling via Initial Thought Transfer
Prasoon Bajpai, Tanmoy Chakraborty

TL;DR
This paper systematically studies test-time scaling in multilingual models, revealing language-dependent effectiveness and proposing MITT, a prefix-tuning method that improves reasoning performance across diverse languages.
Contribution
It is the first to analyze test-time scaling in multilingual settings and introduces MITT, a novel unsupervised prefix-tuning approach to enhance reasoning across languages.
Findings
Test-time scaling gains vary across languages.
Models often switch to English mid-reasoning.
MITT improves reasoning performance, especially for low-resource languages.
Abstract
Test-time scaling has emerged as a widely adopted inference-time strategy for boosting reasoning performance. However, its effectiveness has been studied almost exclusively in English, leaving its behavior in other languages largely unexplored. We present the first systematic study of test-time scaling in multilingual settings, evaluating DeepSeek-R1-Distill-LLama-8B and DeepSeek-R1-Distill-Qwen-7B across both high- and low-resource Latin-script languages. Our findings reveal that the relative gains from test-time scaling vary significantly across languages. Additionally, models frequently switch to English mid-reasoning, even when operating under strictly monolingual prompts. We further show that low-resource languages not only produce initial reasoning thoughts that differ significantly from English but also have lower internal consistency across generations in their early reasoning.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTensor decomposition and applications · Grit, Self-Efficacy, and Motivation · Educational and Psychological Assessments
