Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning
Guijin Son, Jiwoo Hong, Hyunwoo Ko, James Thorne

TL;DR
This paper evaluates the effectiveness of test-time scaling methods on multilingual mathematical reasoning, revealing limited cross-language generalization and introducing a new benchmark for future research.
Contribution
It introduces MCLM, a multilingual math benchmark, and systematically assesses test-time scaling methods, highlighting their limited generalization across languages.
Findings
Test-time scaling methods perform similarly to traditional scaling at similar FLOP levels.
BF improves English AIME scores by 20 points but only 1.94 points on average across other languages.
Test-time scaling shows limited effectiveness in multilingual mathematical reasoning.
Abstract
Scaling pre-training compute has proven effective for achieving mulitlinguality, but does the same hold for test-time scaling? In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages. We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although "thinking LLMs" have recently garnered significant attention, we find that their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. Moreover, while BF yields a 20-point improvement on English AIME, it provides only a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIntelligent Tutoring Systems and Adaptive Learning · Educational Technology and Assessment
