Models That Prove Their Own Correctness
Noga Amit, Shafi Goldwasser, Orr Paradise, Guy Rothblum

TL;DR
This paper introduces Self-Proving models that can verify the correctness of their outputs with high probability using an interactive proof system, providing stronger guarantees than traditional accuracy metrics.
Contribution
It proposes a theoretically-founded framework for training models that can prove their output correctness via an interactive verifier, a novel approach to model reliability.
Findings
Self-Proving models can reliably verify most of their outputs.
Two generic training methods for Self-Proving models are introduced: Transcript Learning and Reinforcement Learning from Verifier Feedback.
The approach guarantees that incorrect outputs are detected by the verifier, enhancing trustworthiness.
Abstract
How can we trust the correctness of a learned model on a particular input of interest? Model accuracy is typically measured on average over a distribution of inputs, giving no guarantee for any fixed input. This paper proposes a theoretically-founded solution to this problem: to train Self-Proving models that prove the correctness of their output to a verification algorithm via an Interactive Proof. Self-Proving models satisfy that, with high probability over an input sampled from a given distribution, the model generates a correct output and successfully proves its correctness to . The soundness property of guarantees that, for every input, no model can convince of the correctness of an incorrect output. Thus, a Self-Proving model proves correctness of most of its outputs, while all incorrect outputs (of any model) are detected by . We devise and analyze two generic…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The idea of using a verifier from the theory of interactive proof systems for learning a self-proving model is very nice. It may lead to further interesting research activities that address the AI safety issue using several related tools from theoretical computer science, such as PCP and property testing etc. 2. The paper is written well. The discussion on related work helped me to understand what people had explored in the past, and to see the contributions of the paper more clearly. Also,
I support the acceptance of this paper. The following points are mostly minor. 1. Having an example in addition to GCD would have convinced me of the promise of the authors' approach far more. As the authors pointed out, the proofs in this GCD case do not involve questions from the verifier, and so they are simple. Also, the annotated transcript learning is only vaguely defined, and it is only explained in terms of illustration in the example via the intermediate steps of the Euclid algorithm.
I find the overall approach principled and welcome. Moving the assessment of the answer of a model from test results to an estimate of the correctness on the individual query (and this not being provided by the model itself) is certainly welcome. I am not sure the overall setup is novel as such (see below); however I believe the error bounds on the two learning approaches are. The results are well backed by theory and the paper does a good job at making this challenging subject as accessible a
It was not clear to me the extent to novelty of the overall setup of the pair prover/disprover. Obviously this is a well-known setup in many applications including those cited in the literature (perhaps the whole area of "Argumentation" in AI should also be included as it is not very far from here), but this particular instantiation with ML models and a formal verifier could be clarified. The verifier obviously has a fundamental role here. I might have missed this but the implications of this w
1. This is a well-motivated work. The use case shown in the introduction of the paper is attractive. 2. The proposed self-proving framework that proves the correctness of the answer by interactions between an autoregressive model and an external verifier seems a natural setting.
1. Theorem 4.1, one of the major theoretical contributions of the paper, makes some strong assumptions. Therefore, I think the contribution of the theorem is limited. Firstly, it assumes that $A(\theta)$ is concave. I think this assumption does not hold for the typical autoregressive models used today, including the GPT model used in experiments. Moreover, the theorem assumes the existence of $\theta^\ast$ satisfying $A(\theta^\ast) \geq 1 - \epsilon/2$. This is also a strong assumption since i
Code & Models
Videos
Models that prove their own correctness· youtube
Taxonomy
TopicsMachine Learning and Algorithms · Adversarial Robustness in Machine Learning · Topic Modeling
