From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics

Bowen Cao; Dongdong Zhang; Yixia Li; Junpeng Liu; Shijue Huang; Chufan Shi; Hongyuan Lu; Yaokang Wu; Guanhua Chen; Wai Lam; Furu Wei

arXiv:2601.23048·cs.AI·April 6, 2026

From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics

Bowen Cao, Dongdong Zhang, Yixia Li, Junpeng Liu, Shijue Huang, Chufan Shi, Hongyuan Lu, Yaokang Wu, Guanhua Chen, Wai Lam, Furu Wei

PDF

1 Datasets 1 Video

TL;DR

This paper investigates the limitations of large language models in contextual mathematical reasoning, introducing a new benchmark and analyzing how formulation and reasoning bottlenecks affect performance.

Contribution

The paper presents ContextMATH, a benchmark for contextual math problems, and provides insights into the challenges of problem formulation and reasoning in LLMs.

Findings

01

Open-source models' performance drops significantly on contextual tasks.

02

Problem formulation errors are the main cause of failures.

03

Larger models improve understanding and reasoning but still face bottlenecks.

Abstract

Large language models now solve many benchmark math problems at near-expert levels, yet this progress has not fully translated into reliable performance in real-world applications. We study this gap through contextual mathematical reasoning, where the mathematical core must be formulated from descriptive scenarios. We introduce ContextMATH, a benchmark that repurposes AIME and MATH-500 problems into two contextual settings: Scenario Grounding (SG), which embeds abstract problems into realistic narratives without increasing reasoning complexity, and Complexity Scaling (CS), which transforms explicit conditions into sub-problems to capture how constraints often appear in practice. Evaluating 61 proprietary and open-source models, we observe sharp drops: on average, open-source models decline by 13 and 34 points on SG and CS, while proprietary models drop by 13 and 20. Error analysis shows…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

bwcao/ContextMATH
dataset· 28 dl
28 dl

Videos

From Abstract to Contextual: What LLMs Still Cannot Do in Mathematics· slideslive