SenseMath: Do LLMs Have Number Sense? Evaluating Shortcut Use, Judgment, and Generation
Haomin Zhuang, Xiangqi Wang, Yili Shen, Ying Cheng, Xiangliang Zhang

TL;DR
SenseMath introduces a benchmark to evaluate whether large language models understand numerical structure and apply shortcuts appropriately, revealing they often overgeneralize and lack true number sense.
Contribution
This work provides a controlled benchmark with diverse tasks to assess LLMs' numerical reasoning, highlighting their limitations in structural understanding and context-aware shortcut use.
Findings
Models adopt shortcuts when prompted, improving accuracy by up to 15%.
Under standard prompting, models use shortcuts in fewer than 40% of cases.
Models overgeneralize shortcuts and cannot generate valid shortcut problems from scratch.
Abstract
Large language models often default to step-by-step computation even when efficient numerical shortcuts are available. This raises a basic question: do they exhibit number sense in a human-like behavioral sense, i.e., the ability to recognize numerical structure, apply shortcuts when appropriate, and avoid them when they are not? We introduce SenseMath, a controlled benchmark for evaluating structure-sensitive numerical reasoning in LLMs. SenseMath contains 4,800 items spanning eight shortcut categories and four digit scales, with matched strong-shortcut, weak-shortcut, and control variants. It supports three evaluation settings of increasing cognitive demand: Shortcut Use (whether models can apply shortcuts on shortcut-amenable problems); Applicability Judgment (whether they can recognize when a shortcut is appropriate or misleading); and Problem Generation (whether they can generate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
