From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics
Bowen Cao, Dongdong Zhang, Yixia Li, Junpeng Liu, Shijue Huang, Chufan Shi, Hongyuan Lu, Yaokang Wu, Guanhua Chen, Wai Lam, Furu Wei

TL;DR
This paper investigates the limitations of large language models in contextual mathematical reasoning, introducing a new benchmark and analyzing how formulation and reasoning bottlenecks affect performance.
Contribution
The paper presents ContextMATH, a benchmark for contextual math problems, and provides insights into the challenges of problem formulation and reasoning in LLMs.
Findings
Open-source models' performance drops significantly on contextual tasks.
Problem formulation errors are the main cause of failures.
Larger models improve understanding and reasoning but still face bottlenecks.
Abstract
Large language models now solve many benchmark math problems at near-expert levels, yet this progress has not fully translated into reliable performance in real-world applications. We study this gap through contextual mathematical reasoning, where the mathematical core must be formulated from descriptive scenarios. We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings: Scenario Grounding (SG), which embeds abstract problems into realistic narratives without increasing reasoning complexity, and Complexity Scaling (CS), which transforms explicit conditions into sub-problems to capture how constraints often appear in practice. Evaluating 61 proprietary and open-source models, we observe sharp drops: on average, open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20. Error analysis shows…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
