Learning to Disprove: Formal Counterexample Generation with Large Language Models
Zenan Li, Zhaoyu Li, Kaiyu Yang, Xiaoxing Ma, Zhendong Su

TL;DR
This paper introduces a method for training large language models to generate formal counterexamples in mathematics, complementing proof generation and improving AI reasoning capabilities.
Contribution
It presents a novel approach combining symbolic mutation and multi-reward training to enhance LLMs in formal counterexample generation and theorem proving.
Findings
Significant performance improvements on three benchmarks.
Effective data synthesis via symbolic mutation.
Enhanced training framework for counterexample generation.
Abstract
Mathematical reasoning demands two critical, complementary skills: constructing rigorous proofs for true statements and discovering counterexamples that disprove false ones. However, current AI efforts in mathematics focus almost exclusively on proof construction, often neglecting the equally important task of finding counterexamples. In this paper, we address this gap by fine-tuning large language models (LLMs) to reason about and generate counterexamples. We formalize this task as formal counterexample generation, which requires LLMs not only to propose candidate counterexamples but also to produce formal proofs that can be automatically verified in the Lean 4 theorem prover. To enable effective learning, we introduce a symbolic mutation strategy that synthesizes diverse training data by systematically extracting theorems and discarding selected hypotheses, thereby producing diverse…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. The formalization of counterexample generation as an informal-to-formal task in Lean 4 is a valuable direction for the community. 2. The Symbolic Mutation Strategy does successfully create a large volume (575K) of synthetically generated problems, which is a practical contribution.
1. The unacceptably large standard deviation relative to the mean reported in the results (e.g., training curves and ablations) makes the empirical claims of superiority unreliable and unsubstantiated. This severe instability suggests the method is not robust and the reported mean performance is likely a statistical artifact of favorable seed initialization. Without robust, repeatable results, the paper cannot be accepted. 2. Both core techniques, Multi-Reward RL and Symbolic Mutation, lack fun
- Addresses an important, underexplored problem in mathematical reasoning - The hypothesis-dropping mutation strategy is principled and guarantees valid problems while maintaining reasonable diversity - Multi-reward framework elegantly handles the sparse reward problem that would otherwise limit training to easy problems - Strong empirical results with 47-74% relative improvements across multiple benchmarks
- Lacks theoretical analysis of what makes the generated problems effective for training - why does this particular distribution of problems work? - Scalability concerns: mutation generates only ~1.78x problems per seed; training saturates after half an epoch suggesting significant redundancy; approach requires existing formal libraries - Model size limitations fundamentally constrain performance - case studies show 7-8B models make basic arithmetic errors and fail to follow instructions, but re
- **Novel Problem Focus**: Fills a crucial gap in AI math reasoning by centering formal counterexample generation, a practice rooted in the "positive hard, negative easy" heuristic (e.g., using counterexamples when proofs stall). - **Innovative Data Synthesis**: The symbolic mutation strategy creatively generates large-scale, high-quality counterexample data (575K problems) from diverse theorem sources, solving the data scarcity bottleneck.
- **Informal-Formal Consistency Gaps**: The pipeline risks inconsistencies between informal natural language counterexamples and formal Lean 4 proofs—e.g., ambiguous natural language descriptions, implicit assumptions, or misaligned predicate definitions may fail formal verification. - **Synthesized Data Redundancy**: The paper notes redundant synthetic data causes early convergence (half an epoch) but lacks detailed analysis of *why* redundancy occurs (e.g., duplicate problem structures, overla
* **Originality of the task formalization:** The paper introduces a novel task to disproving mathematical statements using LLMs. The formalization with Lean 4–verifiable proofs is a meaningful. * **Methodological validity:** The symbolic mutation strategy and multi-reward expert-iteration framework are well-motivated and effectively address the challenges of sparse-reward training. * **Empirical significance:** The system's performance on three new benchmarks demonstrates its potential to adv
* **Clarity**: Although I read the paper carefully, I found some parts hard to follow. For instance, the evaluation metrics are not specified, benchmark construction lacks detail and has typos, and hyperparameter settings are not clearly described. * **Evaluation gaps:** The inference system of the proposed method is two-model but other baselines are single-model. This makes it hard to attribute the performance gain solely to the proposed training method even with the ablation study. More analy
- Tackling disproof rather than proof generation is a valuable direction for formal-reasoning LLMs. - The “hypothesis-dropping” approach provides a practical way to generate large numbers of training problems automatically from existing Lean libraries. - Splitting the task into informal and formal stages aligns with ongoing efforts to model human-like reasoning pipelines (sketch → formalization).
* **Fundamental logical mis-specification.** The central “counterexample problem” is formulated as an existential statement asserting $$ \exists x,\; (H_1(x)\wedge …\wedge H_{k-1}(x)) \wedge C(x), $$ rather than $$ \exists x,\; (H_1(x)\wedge …\wedge H_{k-1}(x)) \wedge \neg C(x). $$ As a result, the system *proves* an existential claim rather than *disproves* a universal one. The paper repeatedly describes these as “counterexample problems,” but they are semantically c
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Logic, programming, and type systems · Machine Learning in Materials Science
