Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

Xia Yang; Xuanyi Zhang; Hao Hu; Feng Ji

arXiv:2605.09292·cs.AI·May 12, 2026

Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning

Xia Yang, Xuanyi Zhang, Hao Hu, Feng Ji

PDF

TL;DR

This paper introduces a framework to evaluate the diversity of reasoning strategies in large language models on math problems, revealing a gap between accuracy and reasoning flexibility.

Contribution

It presents a novel strategy-level evaluation method for LLMs, highlighting the limited diversity of reasoning strategies compared to human references.

Findings

01

Models recover fewer strategies than humans under multiple-strategy prompts.

02

Large models generate hundreds of valid strategies, but still miss many human strategies.

03

Repeated runs yield diminishing returns in strategy discovery.

Abstract

Large language models now achieve high final-answer accuracy on mathematical reasoning benchmarks, but accuracy alone does not capture reasoning flexibility. We introduce a strategy-level evaluation framework instantiated on 80 AMC 10/12 and AIME problems with 217 AoPS-derived reference strategy families. Model outputs are annotated for strategy identity, validity, and correctness using dual-AI coding with human adjudication. Across four frontier models, we find a pronounced decoupling between answer accuracy and strategy diversity. Under a single-solution prompt, all models achieve high accuracy (95%-100%), but under a multiple-strategy prompt they recover substantially fewer strategies than the human reference set. Gemini, DeepSeek, GPT, and Claude generate 184, 152, 151, and 110 distinct valid strategies, respectively, with the largest gaps in Geometry and Number Theory. The models…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.