SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving

Yujie Hou; Mei Wang; Yaoyao Zhong; Ting Zhang; Xuetao Ma; and Hua Huang

arXiv:2505.16646·cs.AI·April 21, 2026

SMART: Self-Generating and Self-Validating Multi-Dimensional Assessment for LLMs' Mathematical Problem Solving

Yujie Hou, Mei Wang, Yaoyao Zhong, Ting Zhang, Xuetao Ma, and Hua Huang

PDF

1 Datasets

TL;DR

SMART is a new benchmark inspired by Polya's theory that evaluates LLMs across four cognitive dimensions of mathematical problem solving, revealing genuine weaknesses and proposing a more comprehensive metric.

Contribution

It introduces a multi-dimensional assessment framework for LLMs' mathematical reasoning, addressing limitations of existing evaluation methods.

Findings

01

Substantial discrepancies in LLMs' capabilities across cognitive dimensions.

02

Current models show weaknesses in reasoning and reflection.

03

The All-Pass Score better captures true problem-solving ability.

Abstract

Large Language Models (LLMs) have achieved remarkable performance across a wide range of mathematical benchmarks. However, concerns remain as to whether these successes reflect genuine reasoning or superficial pattern recognition. Existing evaluation methods, which typically focus either on the final answer or on the intermediate reasoning steps, reduce mathematical reasoning to a shallow input-output mapping, overlooking its inherently multi-stage and multi-dimensional cognitive nature. Inspired by Polya's problem-solving theory, we propose SMART, a benchmark that decomposes mathematical problem-solving into four cognitive dimensions: Semantic Understanding, Mathematical Reasoning, Arithmetic Computation, and Reflection & Refinement, and introduces dimension-specific tasks to measure the corresponding cognitive processes of LLMs. We apply SMART to 22 state-of-the-art open- and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

ewdfd/SMART
dataset· 57 dl
57 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.