Beyond Accuracy: Diagnosing Algebraic Reasoning Failures in LLMs Across Nine Complexity Dimensions
Parth Patil, Dhruv Kumar, Yash Sinha, Murari Mandal

TL;DR
This paper introduces a nine-dimension algebraic complexity framework for diagnosing specific failure modes in large language models, enabling detailed analysis of their algebraic reasoning capabilities.
Contribution
It presents a novel, automated, multi-dimensional framework for systematically varying algebraic problem complexity and diagnosing model failures across independent factors.
Findings
Working memory is the main bottleneck across models.
All models fail between 20 and 30 parallel branches regardless of size.
A subset of five dimensions suffices to diagnose algebraic failures.
Abstract
Algebraic reasoning remains one of the most informative stress tests for large language models, yet current benchmarks provide no mechanism for attributing failure to a specific cause. When a model fails an algebraic problem, a single accuracy score cannot reveal whether the expression was too deeply nested, the operator too uncommon, the intermediate state count too high, or the dependency chain too long. Prior work has studied individual failure modes in isolation, but no framework has varied each complexity factor independently under strict experimental control. No prior system has offered automatic generation and verification of problems of increasing complexity to track model progress over time. We introduce a nine-dimension algebraic complexity framework in which each factor is varied independently while all others are held fixed, with problem generation and verification handled…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
