Probing Structural Mathematical Reasoning in Language Models with Algebraic Trapdoors
Igor Rivin

TL;DR
This paper introduces a benchmark suite for assessing structural mathematical reasoning in language models, focusing on subgroup problems in SL(3, Z) with cryptographic-style verification, to distinguish models with algebraic priors from those relying on general computation.
Contribution
The paper presents a novel benchmark for evaluating algebraic reasoning in language models, highlighting their ability to handle subgroup problems with cryptographic-style verification.
Findings
One model identified the membership question as the bottleneck.
Models demonstrated calibrated meta-cognition by abstaining rather than guessing.
The benchmark reveals a four-way classification of model behavior.
Abstract
We introduce a benchmark suite for evaluating structural mathematical reasoning in language models, built on subgroup-construction problems in SL(3, Z) with cryptographic-style verifier-prover asymmetry. Each instance presents a finitely generated subgroup as a list of integer matrices and asks for an arithmetic invariant -- index, surjection-at-prime, or membership -- that the construction-time information (N, K) pins down in O(1) closed form, but that the solver, lacking that information, must derive by either Aschbacher-classification analysis or by a membership query in SL(3, Z) of unknown decidability. The benchmark therefore distinguishes models with internalized algebraic priors (Aschbacher classes, McLaughlin's theorem, Property (T), the congruence subgroup property) from models that rely on general-purpose computation. We report empirical results across five representative…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
