LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
Shradha Agarwal, Deepak Rajbhar, Tariq J

TL;DR
LinAlg-Bench is a comprehensive diagnostic benchmark that evaluates large language models on structured linear algebra tasks, revealing a scale-dependent shift from execution errors to fabrication and abandonment failures.
Contribution
The paper introduces LinAlg-Bench, a detailed forensic framework for analyzing LLM failures in linear algebra, uncovering a universal transition at 4x4 matrix scale and new structured hallucination modes.
Findings
Failure modes are structurally constrained by matrix size and algorithm type.
A sharp behavioral threshold at 4x4 matrices marks a transition from execution errors to fabrication.
Solution strategy rigidity predicts 5x5 determinant accuracy with high precision.
Abstract
We introduce LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier large language models on structured linear algebra computation across a strict dimensional gradient of 3x3, 4x4, and 5x5 matrices. Spanning 9 task types and 660 SymPy-certified problems, the benchmark exhaustively evaluates 6,600 model outputs. Beyond binary accuracy, LinAlg-Bench introduces a three-stage automated forensic pipeline classifying 1,156 failures into ten primary error tags with fine-grained subtypes, revealing that LLM mathematical failure is not random but structurally constrained by algorithm type and matrix dimension. Our central finding is a sharp behavioral threshold at 4x4 scale: below it, models fail through execution errors -- sign tracking failures, arithmetic drift, and parity errors; above it, failure transitions to computational abandonment, with models fabricating responses through tool…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
