TL;DR
This paper presents a self-play framework using formal verification for semantic equivalence in Haskell, improving code reasoning in language models with a new dataset and empirical validation.
Contribution
It introduces a novel self-play training method guided by formal proofs, along with a large synthetic dataset and empirical results demonstrating improved downstream task performance.
Findings
Evaluator achieves up to 13.3pp accuracy gain on EquiBench.
Equivalence proofs enhance reasoning capabilities more than inequivalence supervision.
The training pipeline and dataset are publicly available.
Abstract
We introduce a self-play framework for semantic equivalence in Haskell, utilizing formal verification to guide adversarial training between a generator and an evaluator. The framework leverages Liquid Haskell proofs for validating equivalence and execution-based counterexamples for inequivalence, organized via a difficulty-aware curriculum. To facilitate this, we release \textbf{OpInstruct-HSx}, a synthetic dataset of 28k validated Haskell programs. Empirical experiments show that our evaluator transfers effectively to downstream tasks, achieving up to 13.3pp accuracy gain on EquiBench and consistent gains on PySecDB. Ablation studies on the SEQ-SINQ regimes indicate that while inequivalence supervision provides data volume, equivalence proofs are uniquely responsible for the model's reasoning capabilities. The entire training pipeline and dataset are publicly released on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
