Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

Ziling Cheng; Meng Cao; Leila Pishdad; Yanshuai Cao; Jackie Chi Kit Cheung

arXiv:2505.23701·cs.CL·May 30, 2025

Can LLMs Reason Abstractly Over Math Word Problems Without CoT? Disentangling Abstract Formulation From Arithmetic Computation

Ziling Cheng, Meng Cao, Leila Pishdad, Yanshuai Cao, Jackie Chi Kit Cheung

PDF

TL;DR

This paper disentangles the reasoning sub-skills of abstract formulation and arithmetic computation in LLMs on math problems, revealing that computation is the main bottleneck and CoT mainly aids calculation rather than abstraction.

Contribution

It introduces a disentangled evaluation framework for LLM reasoning, demonstrating that abstract formulation and computation are distinct, conjunctively composed skills, with computation being the primary performance bottleneck.

Findings

01

Final-answer accuracy is bottlenecked by computation, not abstraction.

02

Chain-of-Thought mainly improves arithmetic computation, not abstraction.

03

Models first capture problem abstractions, then perform calculations.

Abstract

Final-answer-based metrics are commonly used for evaluating large language models (LLMs) on math word problems, often taken as proxies for reasoning ability. However, such metrics conflate two distinct sub-skills: abstract formulation (capturing mathematical relationships using expressions) and arithmetic computation (executing the calculations). Through a disentangled evaluation on GSM8K and SVAMP, we find that the final-answer accuracy of Llama-3 and Qwen2.5 (1B-32B) without CoT is overwhelmingly bottlenecked by the arithmetic computation step and not by the abstract formulation step. Contrary to the common belief, we show that CoT primarily aids in computation, with limited impact on abstract formulation. Mechanistically, we show that these two skills are composed conjunctively even in a single forward pass without any reasoning steps via an abstract-then-compute mechanism: models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsActivation Patching