Beyond Accuracy: Evaluating Strategy Diversity in LLM Mathematical Reasoning
Xia Yang, Xuanyi Zhang, Hao Hu, Feng Ji

TL;DR
This paper introduces a framework to evaluate the diversity of reasoning strategies in large language models on math problems, revealing a gap between accuracy and reasoning flexibility.
Contribution
It presents a novel strategy-level evaluation method for LLMs, highlighting the limited diversity of reasoning strategies compared to human references.
Findings
Models recover fewer strategies than humans under multiple-strategy prompts.
Large models generate hundreds of valid strategies, but still miss many human strategies.
Repeated runs yield diminishing returns in strategy discovery.
Abstract
Large language models now achieve high final-answer accuracy on mathematical reasoning benchmarks, but accuracy alone does not capture reasoning flexibility. We introduce a strategy-level evaluation framework instantiated on 80 AMC 10/12 and AIME problems with 217 AoPS-derived reference strategy families. Model outputs are annotated for strategy identity, validity, and correctness using dual-AI coding with human adjudication. Across four frontier models, we find a pronounced decoupling between answer accuracy and strategy diversity. Under a single-solution prompt, all models achieve high accuracy (95%-100%), but under a multiple-strategy prompt they recover substantially fewer strategies than the human reference set. Gemini, DeepSeek, GPT, and Claude generate 184, 152, 151, and 110 distinct valid strategies, respectively, with the largest gaps in Geometry and Number Theory. The models…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
