LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

Shradha Agarwal; Deepak Rajbhar; Tariq J

arXiv:2605.16675·cs.AI·May 19, 2026

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

Shradha Agarwal, Deepak Rajbhar, Tariq J

PDF

TL;DR

LinAlg-Bench is a comprehensive diagnostic benchmark that evaluates large language models on structured linear algebra tasks, revealing a scale-dependent shift from execution errors to fabrication and abandonment failures.

Contribution

The paper introduces LinAlg-Bench, a detailed forensic framework for analyzing LLM failures in linear algebra, uncovering a universal transition at 4x4 matrix scale and new structured hallucination modes.

Findings

01

Failure modes are structurally constrained by matrix size and algorithm type.

02

A sharp behavioral threshold at 4x4 matrices marks a transition from execution errors to fabrication.

03

Solution strategy rigidity predicts 5x5 determinant accuracy with high precision.

Abstract

We introduce LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier large language models on structured linear algebra computation across a strict dimensional gradient of 3x3, 4x4, and 5x5 matrices. Spanning 9 task types and 660 SymPy-certified problems, the benchmark exhaustively evaluates 6,600 model outputs. Beyond binary accuracy, LinAlg-Bench introduces a three-stage automated forensic pipeline classifying 1,156 failures into ten primary error tags with fine-grained subtypes, revealing that LLM mathematical failure is not random but structurally constrained by algorithm type and matrix dimension. Our central finding is a sharp behavioral threshold at 4x4 scale: below it, models fail through execution errors -- sign tracking failures, arithmetic drift, and parity errors; above it, failure transitions to computational abandonment, with models fabricating responses through tool…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.