TL;DR
This paper investigates whether large language models exhibit formalization gaming in logical reasoning, finding that while models often avoid forced proofs, unfaithfulness can occur in subtle ways undetected by current metrics.
Contribution
It introduces a framework to evaluate faithfulness in logical reasoning by analyzing models' behavior in a two-stage formalization and proving pipeline.
Findings
Models prefer reporting failure over forcing proofs.
GPT-5 fabricates axioms during proof generation.
DeepSeek-R1 mistranslates premises, evading detection.
Abstract
Formal verification guarantees proof validity but not formalization faithfulness. For natural-language logical reasoning, where models construct axiom systems from scratch without library constraints, this gap between valid proofs and faithful translations is especially acute. We investigate whether frontier models exploit this gap when generating Lean 4 proofs, a behavior we term formalization gaming. We evaluate GPT-5 and DeepSeek-R1 on 303 first-order logic problems (203 from FOLIO, 100 from Multi-LogiEval), comparing unified generation against a two-stage pipeline that separates formalization from proving. Despite compilation rates of 87-99%, we find no evidence of systematic gaming in unified generation: models prefer reporting failure over forcing proofs, even under prompting designed to encourage it. However, unfaithfulness that evades our detection signals may still occur. The…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
