Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game

Lixing Li

arXiv:2605.00677·cs.LG·May 4, 2026

Evaluating the Architectural Reasoning Capabilities of LLM Provers via the Obfuscated Natural Number Game

Lixing Li

PDF

TL;DR

This paper introduces a benchmark to evaluate the genuine logical reasoning abilities of large language models in formal mathematics, focusing on their capacity to synthesize proofs in obfuscated environments.

Contribution

It presents the Obfuscated Natural Number Game as a novel zero-knowledge benchmark to measure architectural reasoning in LLMs, revealing robustness differences among models.

Findings

01

Reasoning models maintain accuracy despite obfuscation.

02

General models show increased inference time with obfuscation.

03

The benchmark provides a quantitative measure of mathematical reasoning capacity.

Abstract

While Large Language Models have achieved notable success on formal mathematics benchmarks such as MiniF2F, it remains unclear whether these results stem from genuine logical reasoning or semantic pattern matching against pre-training data. This paper identifies Architectural Reasoning: the ability to synthesize formal proofs using exclusively local axioms and definitions within an alien math domain, as the necessary ability for future automated theorem discovery AI. We use the Obfuscated Natural Number Game, a benchmark to evaluate Architectural Reasoning. By renaming identifiers in the Natural Number Game in Lean 4, we created a zero-knowledge, closed environment. We evaluate state-of-the-art models, finding a universal latency tax where obfuscation increases inference time. The results also reveal a divergence in robustness: while general models (Claude-Sonnet-4.5, GPT-4o) suffer…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.