Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

Guijin Son; Jiwoo Hong; Hyunwoo Ko; James Thorne

arXiv:2502.17407·cs.CL·August 4, 2025

Linguistic Generalizability of Test-Time Scaling in Mathematical Reasoning

Guijin Son, Jiwoo Hong, Hyunwoo Ko, James Thorne

PDF

Open Access 1 Repo 2 Datasets

TL;DR

This paper evaluates the effectiveness of test-time scaling methods on multilingual mathematical reasoning, revealing limited cross-language generalization and introducing a new benchmark for future research.

Contribution

It introduces MCLM, a multilingual math benchmark, and systematically assesses test-time scaling methods, highlighting their limited generalization across languages.

Findings

01

Test-time scaling methods perform similarly to traditional scaling at similar FLOP levels.

02

BF improves English AIME scores by 20 points but only 1.94 points on average across other languages.

03

Test-time scaling shows limited effectiveness in multilingual mathematical reasoning.

Abstract

Scaling pre-training compute has proven effective for achieving mulitlinguality, but does the same hold for test-time scaling? In this work, we introduce MCLM, a multilingual math benchmark featuring competition-level problems in 55 languages. We test three test-time scaling methods-Outcome Reward Modeling (ORM), Process Reward Modeling (ORM), and Budget Forcing (BF)-on both Qwen2.5-1.5B Math and MR1-1.5B, a multilingual LLM we trained for extended reasoning. Our experiments show that using Qwen2.5-1.5B Math with ORM achieves a score of 35.8 on MCLM, while BF on MR1-1.5B attains 35.2. Although "thinking LLMs" have recently garnered significant attention, we find that their performance is comparable to traditional scaling methods like best-of-N once constrained to similar levels of inference FLOPs. Moreover, while BF yields a 20-point improvement on English AIME, it provides only a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

gauss5930/mclm
noneOfficial

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIntelligent Tutoring Systems and Adaptive Learning · Educational Technology and Assessment