Long Chain-of-Thought Reasoning Across Languages
Josh Barua, Seun Eisape, Kayo Yin, Alane Suhr

TL;DR
This paper investigates how long chain-of-thought reasoning abilities in large models transfer from English to multiple other languages, analyzing development stages, data strategies, and language-specific challenges.
Contribution
It systematically studies multilingual long CoT reasoning, compares different training and inference methods, and proposes synthetic data approaches to improve non-English reasoning capabilities.
Findings
Scaling improves multilingual performance in English reasoning.
Pretraining with specialized reasoning enhances English CoT but may harm target languages.
Synthetic data fine-tuning outperforms direct translation for non-English reasoning traces.
Abstract
While large reasoning models have shown remarkable ability to generate long chains-of-thought (CoTs) in English, we still lack understanding of how these long-form reasoning abilities transfer to the vast majority of the world's languages. In this work, we systematically investigate four key stages of model development--scaling, pretraining, post-training, and inference--to understand how long CoT capabilities extend beyond English. We compare two reasoning settings across nine non-English target languages: En-CoT, where models process target-language inputs, but reason in English; and Target-CoT, where models both process inputs and generate long CoTs in the target language. We find that scaling reasoning model size improves multilingual task performance in En-CoT, but Target-CoT performance lags behind. This gap widens for tasks requiring long, multi-step CoTs such as mathematical…
Peer Reviews
Decision·ICLR 2026 Poster
The paper is well-written and framed clearly with three setups and a comprehensive evaluation of nine languages, covering high/middle/low resource languages. Also, the scaling study is carefully controlled and highlights that Target-CoT never reaches English-reasoning levels, even at 32B; switching to target-language reasoning at 32B still performs lower than a 7B English baseline. Besides, the post-training section is practical. It shows that with only ~1k target-language traces, translated f
Although the authors benchmark translators and justified the usage of Gemini-2.0-Flash in Appendix B.5, it may still be promising to further measure the quality of the translated datasets with existing translation quality estimation metrics (e.g., xCOMET, MetricX). These scores will directly show that the translated datasets are reliable and trustworthy. The evaluated model only covers one language family, that is, the Qwen series. Although Deepseek-Distilled-R1 is trained mainly on English an
1. This paper first investigates the long cot in LLMs across languages. 2. The expriments offer some insights for furture improvement.
1. The quality of the experiments remain to be improved. 2. Some experiment settings are strange. 3. The analyses could be more thorough. All can refer to the questions below.
- The paper tackles a novel and meaningful question—how long CoT reasoning transfers across languages—addressing a major gap in multilingual LLM research. - Comprehensive evaluation across nine languages and multiple resource levels offers strong empirical grounding. - The finding that translated synthetic data can substitute for large English datasets is practical and impactful for multilingual model training.
- The study relies solely on Qwen-family models (Qwen2.5, Qwen2.5-Math, Qwen3), limiting generalizability; results might be model-specific rather than universal. - Although reports describe these Qwen models’ data composition, the exact pretraining and fine-tuning details remain opaque; using them as representative backbones may reduce methodological rigor.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSemantic Web and Ontologies · Natural Language Processing Techniques
