Multilingual Test-Time Scaling via Initial Thought Transfer

Prasoon Bajpai; Tanmoy Chakraborty

arXiv:2505.15508·cs.CL·May 22, 2025

Multilingual Test-Time Scaling via Initial Thought Transfer

Prasoon Bajpai, Tanmoy Chakraborty

PDF

Open Access

TL;DR

This paper systematically studies test-time scaling in multilingual models, revealing language-dependent effectiveness and proposing MITT, a prefix-tuning method that improves reasoning performance across diverse languages.

Contribution

It is the first to analyze test-time scaling in multilingual settings and introduces MITT, a novel unsupervised prefix-tuning approach to enhance reasoning across languages.

Findings

01

Test-time scaling gains vary across languages.

02

Models often switch to English mid-reasoning.

03

MITT improves reasoning performance, especially for low-resource languages.

Abstract

Test-time scaling has emerged as a widely adopted inference-time strategy for boosting reasoning performance. However, its effectiveness has been studied almost exclusively in English, leaving its behavior in other languages largely unexplored. We present the first systematic study of test-time scaling in multilingual settings, evaluating DeepSeek-R1-Distill-LLama-8B and DeepSeek-R1-Distill-Qwen-7B across both high- and low-resource Latin-script languages. Our findings reveal that the relative gains from test-time scaling vary significantly across languages. Additionally, models frequently switch to English mid-reasoning, even when operating under strictly monolingual prompts. We further show that low-resource languages not only produce initial reasoning thoughts that differ significantly from English but also have lower internal consistency across generations in their early reasoning.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTensor decomposition and applications · Grit, Self-Efficacy, and Motivation · Educational and Psychological Assessments