Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models?

Zhi Rui Tam; Cheng-Kuang Wu; Yu Ying Chiu; Chieh-Yen Lin; Yun-Nung Chen; Hung-yi Lee

arXiv:2505.17407·cs.CL·May 26, 2025

Language Matters: How Do Multilingual Input and Reasoning Paths Affect Large Reasoning Models?

Zhi Rui Tam, Cheng-Kuang Wu, Yu Ying Chiu, Chieh-Yen Lin, Yun-Nung Chen, Hung-yi Lee

PDF

4 Reviews

TL;DR

This paper investigates how multilingual large reasoning models choose their reasoning language, revealing biases towards high-resource languages and varying effects on different task types, which impacts model fairness and performance.

Contribution

It uncovers the tendency of LRMs to default to high-resource languages for reasoning and analyzes how language choice affects performance across diverse tasks.

Findings

01

LRMs prefer reasoning in high-resource languages like English.

02

Performance drops when reasoning in the input language, especially for low-resource languages.

03

Task type influences the impact of language choice on model performance.

Abstract

Large reasoning models (LRMs) have demonstrated impressive performance across a range of reasoning tasks, yet little is known about their internal reasoning processes in multilingual settings. We begin with a critical question: {\it In which language do these models reason when solving problems presented in different languages?} Our findings reveal that, despite multilingual training, LRMs tend to default to reasoning in high-resource languages (e.g., English) at test time, regardless of the input language. When constrained to reason in the same language as the input, model performance declines, especially for low-resource languages. In contrast, reasoning in high-resource languages generally preserves performance. We conduct extensive evaluations across reasoning-intensive tasks (MMMLU, MATH-500) and non-reasoning benchmarks (CulturalBench, LMSYS-toxic), showing that the effect of…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 4Confidence 4

Strengths

1. this paper is well written 1. The topic is timely and important, with potential real-world impact—especially for users of low-resource languages who are often overlooked in multilingual LLM research. 1. the authors test across eight diverse languages and several reasoning and behavioral benchmarks, which strengthens the generalizability of the findings. 1. The work provides valuable insights into the trade-off between reasoning performance and cultural or safety alignment

Weaknesses

1. While the study is well executed, prior research has already identified that multilingual models tend to reason in a dominant hub language (e.g., English) and display inconsistencies across languages. Earlier work also attributes this effect partly to decoding and translation issues (models may reason correctly in the hub language internally but fail to translate their reasoning back to the target language). This paper extends that paradigm to reasoning models, but the conceptual advance beyo

Reviewer 02Rating 4Confidence 4

Strengths

The paper focuses on an important question about multilingual reasoning traces. It also proposed a simple prefill method that reliably shifts reasoning language and enables controlled comparisons. The evaluation clearly shows that English-prefilled reasoning often boosts MATH-500/MMMLU, especially for lower-resource languages. In-depth analysis is conducted on scaling and difficulty, demonstrating that harder tasks enlarge the gap and favor English reasoning. However, in the cultural reasoning t

Weaknesses

A published paper at EMNLP 2025 [1] also indicates the same findings: LRMs are weak in accuracy when reasoning in non-English languages, compared with English reasoning for multilingual questions. Similarly, "prefix-hacking" (i.e., prefill) prompts are introduced by their paper to force the model to think in a specific language to benefit users speaking different languages. Plus, it also includes a post-training exploration to mitigate the mismatch of reasoning languages when the prompt is in no

Reviewer 03Rating 2Confidence 5

Strengths

1. Multiple languages with a good distribution between low-resource and high-resource 2. Interesting findings on benchmark performance disparity 3. Use of open models 4. Correlation analysis of the results--while not strictly needed as part of the main body, still an excellent thing to see

Weaknesses

This paper has interesting findings and methodology. In my opinion, however, it is not ready to be accepted at ICLR. I believe this to be true due to three major concerns: 1. The writing is not up to par. There are many typos, incomplete sentences, and misuse of \citet versus \citep. Anthropomorphism ('LRMs predominantly think') is rampant. 2. The rigour (both in terms of argumentation and experimentation) is not there. In terms of argumentation, there are several claims (e.g., L195-196; L258) w

Reviewer 04Rating 2Confidence 4

Strengths

- The paper reveals that despite language models' strong multilingual capabilities, they predominantly prefer to reason in hub languages like English. This provides a valuable empirical understanding of the models’ reasoning traces and language biases during reasoning. - The introduction of a text pre-filling method to steer the reasoning language demonstrates a concrete, effective strategy for improving or maintaining performance in multilingual reasoning tasks, while also highlighting the bene

Weaknesses

- The argument and logic in lines 78–85 are difficult to follow. It is unclear how using non-preferred versus preferred languages affects reasoning and non-reasoning tasks, and how the reported asymmetric effect arises. - The classification of MMMLU as a reasoning benchmark seems questionable. While MMMLU includes some reasoning tasks, it also contains many knowledge-intensive questions (e.g., “In what year did the Continental Reformation begin?”). It would be better to clarify the rationale for

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.