Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game
Lixing Li

TL;DR
This paper introduces a benchmark to evaluate the genuine logical reasoning abilities of large language models in formal mathematics, focusing on their capacity to synthesize proofs in obfuscated environments.
Contribution
It presents the Obfuscated Natural Number Game as a novel zero-knowledge benchmark to measure architectural reasoning in LLMs, revealing robustness differences among models.
Findings
Reasoning models maintain accuracy despite obfuscation.
General models show increased inference time with obfuscation.
The benchmark provides a quantitative measure of mathematical reasoning capacity.
Abstract
While Large Language Models have achieved notable success on formal mathematics benchmarks such as MiniF2F, it remains unclear whether these results stem from genuine logical reasoning or semantic pattern matching against pre-training data. This paper identifies Architectural Reasoning: the ability to synthesize formal proofs using exclusively local axioms and definitions within an alien math domain, as the necessary ability for future automated theorem discovery AI. We use the Obfuscated Natural Number Game, a benchmark to evaluate Architectural Reasoning. By renaming identifiers in the Natural Number Game in Lean 4, we created a zero-knowledge, closed environment. We evaluate state-of-the-art models, finding a universal latency tax where obfuscation increases inference time. The results also reveal a divergence in robustness: while general models (Claude-Sonnet-4.5, GPT-4o) suffer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
