Crosslingual Reasoning through Test-Time Scaling

Zheng-Xin Yong; M. Farid Adilazuarda; Jonibek Mansurov and; Ruochen Zhang; Niklas Muennighoff; Carsten Eickhoff; Genta Indra; Winata; Julia Kreutzer; Stephen H. Bach; Alham Fikri Aji

arXiv:2505.05408·cs.CL·May 9, 2025

Crosslingual Reasoning through Test-Time Scaling

Zheng-Xin Yong, M. Farid Adilazuarda, Jonibek Mansurov and, Ruochen Zhang, Niklas Muennighoff, Carsten Eickhoff, Genta Indra, Winata, Julia Kreutzer, Stephen H. Bach, Alham Fikri Aji

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper explores how scaling inference in English-centric multilingual language models enhances crosslingual reasoning, revealing mechanisms, limitations, and strategies for better multilingual and out-of-domain reasoning performance.

Contribution

It demonstrates that test-time scaling improves multilingual reasoning, uncovers the language control of chain-of-thoughts, and highlights the limitations in out-of-domain generalization.

Findings

01

Scaling inference improves multilingual reasoning, especially in low-resource languages.

02

Models follow a quote-and-think pattern for non-English inputs.

03

Controlling reasoning language enhances reasoning efficiency and accuracy.

Abstract

Reasoning capabilities of large language models are primarily studied for English, even when pretrained models are multilingual. In this work, we investigate to what extent English reasoning finetuning with long chain-of-thoughts (CoTs) can generalize across languages. First, we find that scaling up inference compute for English-centric reasoning language models (RLMs) improves multilingual mathematical reasoning across many languages including low-resource languages, to an extent where they outperform models twice their size. Second, we reveal that while English-centric RLM's CoTs are naturally predominantly English, they consistently follow a quote-and-think pattern to reason about quoted non-English inputs. Third, we discover an effective strategy to control the language of long CoT reasoning, and we observe that models reason better and more efficiently in high-resource languages.…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 4Confidence 3

Strengths

1. The research questions are clearly stated and focused, with the experiments neatly mapped / aligned to answering them, and the paper is well-written. 2. It is useful to tie the core behavior to how the model was trained, specifically in the s1 training data, as per appendix C.4 (this should be brought to the main text, in my view). 3. An complete evaluation of the query language and reasoning language in Table 2 is illustrative of the strong performing high-resource languages and the low-re

Weaknesses

1. While it is evident (and plausible) that the “quote-and-think” behavior is present in generations, it is not necessarily causal, as is implied. Is there any evidence to suggest that removing or masking the non-English spans degrades accuracy? 2. Excluding Latin-script languages to avoid misclassification limits generality, so it’s a bit unclear how robust the “quote-and-think” behavior is across the full set of languages. 3. All experiments use s1 / the Qwen family of models, it would be val

Reviewer 02Rating 6Confidence 3

Strengths

1. Extensive experiments have been conducted to investigate the extent to which English reasoning finetuning can generalize across languages. 2 .The 'quote-and-think' approach is particularly insightful. It demonstrates that the model does not merely translate non-English input into English before reasoning, but actively parses and reasons over the original linguistic structure.

Weaknesses

I want to know whether the improvement is not only from English to low-resource languages, but whether using any high-resource language similarly improves reasoning in relatively low-resource languages.

Reviewer 03Rating 6Confidence 4

Strengths

The paper is written clearly, with tightly scoped research questions. It shows that scaling test-time "thinking" tokens reliably boosts accuracy for models with more than 3B parameters. The tested model size is large enough, like the s1-14B model with 8k thinking tokens. The analysis uncovers a mechanistic "quote-and-think" phenomenon in which the model mainly reasons in English but quotes non-English fragments from the prompt. The paper also provides actionable guidance on language control via

Weaknesses

- (i) The findings are primarily on s1 (basis: Qwen). It would be better to verify if identical trends hold for other multilingual bases (e.g., Llama, DeepSeek-Distilled-R1) under the same setup. - (ii) Section 6 on "Language Forcing" is also an important section. However, the analysis is only conducted on one dataset, that is, MGSM, limiting the generality of the findings. It is expected to see a similar study on other multilingual reasoning datasets.

Code & Models

Repositories

BatsResearch/crosslingual-test-time-scaling
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications