Learning to Disprove: Formal Counterexample Generation with Large Language Models

Zenan Li; Zhaoyu Li; Kaiyu Yang; Xiaoxing Ma; Zhendong Su

arXiv:2603.19514·cs.AI·March 23, 2026

Learning to Disprove: Formal Counterexample Generation with Large Language Models

Zenan Li, Zhaoyu Li, Kaiyu Yang, Xiaoxing Ma, Zhendong Su

PDF

Open Access 5 Reviews

TL;DR

This paper introduces a method for training large language models to generate formal counterexamples in mathematics, complementing proof generation and improving AI reasoning capabilities.

Contribution

It presents a novel approach combining symbolic mutation and multi-reward training to enhance LLMs in formal counterexample generation and theorem proving.

Findings

01

Significant performance improvements on three benchmarks.

02

Effective data synthesis via symbolic mutation.

03

Enhanced training framework for counterexample generation.

Abstract

Mathematical reasoning demands two critical, complementary skills: constructing rigorous proofs for true statements and discovering counterexamples that disprove false ones. However, current AI efforts in mathematics focus almost exclusively on proof construction, often neglecting the equally important task of finding counterexamples. In this paper, we address this gap by fine-tuning large language models (LLMs) to reason about and generate counterexamples. We formalize this task as formal counterexample generation, which requires LLMs not only to propose candidate counterexamples but also to produce formal proofs that can be automatically verified in the Lean 4 theorem prover. To enable effective learning, we introduce a symbolic mutation strategy that synthesizes diverse training data by systematically extracting theorems and discarding selected hypotheses, thereby producing diverse…

Peer Reviews

Decision·ICLR 2026 Conference Withdrawn Submission

Reviewer 01Rating 2Confidence 5

Strengths

1. The formalization of counterexample generation as an informal-to-formal task in Lean 4 is a valuable direction for the community. 2. The Symbolic Mutation Strategy does successfully create a large volume (575K) of synthetically generated problems, which is a practical contribution.

Weaknesses

1. The unacceptably large standard deviation relative to the mean reported in the results (e.g., training curves and ablations) makes the empirical claims of superiority unreliable and unsubstantiated. This severe instability suggests the method is not robust and the reported mean performance is likely a statistical artifact of favorable seed initialization. Without robust, repeatable results, the paper cannot be accepted. 2. Both core techniques, Multi-Reward RL and Symbolic Mutation, lack fun

Reviewer 02Rating 6Confidence 2

Strengths

- Addresses an important, underexplored problem in mathematical reasoning - The hypothesis-dropping mutation strategy is principled and guarantees valid problems while maintaining reasonable diversity - Multi-reward framework elegantly handles the sparse reward problem that would otherwise limit training to easy problems - Strong empirical results with 47-74% relative improvements across multiple benchmarks

Weaknesses

- Lacks theoretical analysis of what makes the generated problems effective for training - why does this particular distribution of problems work? - Scalability concerns: mutation generates only ~1.78x problems per seed; training saturates after half an epoch suggesting significant redundancy; approach requires existing formal libraries - Model size limitations fundamentally constrain performance - case studies show 7-8B models make basic arithmetic errors and fail to follow instructions, but re

Reviewer 03Rating 6Confidence 3

Strengths

- **Novel Problem Focus**: Fills a crucial gap in AI math reasoning by centering formal counterexample generation, a practice rooted in the "positive hard, negative easy" heuristic (e.g., using counterexamples when proofs stall). - **Innovative Data Synthesis**: The symbolic mutation strategy creatively generates large-scale, high-quality counterexample data (575K problems) from diverse theorem sources, solving the data scarcity bottleneck.

Weaknesses

- **Informal-Formal Consistency Gaps**: The pipeline risks inconsistencies between informal natural language counterexamples and formal Lean 4 proofs—e.g., ambiguous natural language descriptions, implicit assumptions, or misaligned predicate definitions may fail formal verification. - **Synthesized Data Redundancy**: The paper notes redundant synthetic data causes early convergence (half an epoch) but lacks detailed analysis of *why* redundancy occurs (e.g., duplicate problem structures, overla

Reviewer 04Rating 6Confidence 3

Strengths

* **Originality of the task formalization:** The paper introduces a novel task to disproving mathematical statements using LLMs. The formalization with Lean 4–verifiable proofs is a meaningful. * **Methodological validity:** The symbolic mutation strategy and multi-reward expert-iteration framework are well-motivated and effectively address the challenges of sparse-reward training. * **Empirical significance:** The system's performance on three new benchmarks demonstrates its potential to adv

Weaknesses

* **Clarity**: Although I read the paper carefully, I found some parts hard to follow. For instance, the evaluation metrics are not specified, benchmark construction lacks detail and has typos, and hyperparameter settings are not clearly described. * **Evaluation gaps:** The inference system of the proposed method is two-model but other baselines are single-model. This makes it hard to attribute the performance gain solely to the proposed training method even with the ablation study. More analy

Reviewer 05Rating 2Confidence 4

Strengths

- Tackling disproof rather than proof generation is a valuable direction for formal-reasoning LLMs. - The “hypothesis-dropping” approach provides a practical way to generate large numbers of training problems automatically from existing Lean libraries. - Splitting the task into informal and formal stages aligns with ongoing efforts to model human-like reasoning pipelines (sketch → formalization).

Weaknesses

* **Fundamental logical mis-specification.** The central “counterexample problem” is formulated as an existential statement asserting $$ \exists x,\; (H_1(x)\wedge …\wedge H_{k-1}(x)) \wedge C(x), $$ rather than $$ \exists x,\; (H_1(x)\wedge …\wedge H_{k-1}(x)) \wedge \neg C(x). $$ As a result, the system *proves* an existential claim rather than *disproves* a universal one. The paper repeatedly describes these as “counterexample problems,” but they are semantically c

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMathematics, Computing, and Information Processing · Logic, programming, and type systems · Machine Learning in Materials Science