Models That Prove Their Own Correctness

Noga Amit; Shafi Goldwasser; Orr Paradise; Guy Rothblum

arXiv:2405.15722·cs.LG·December 19, 2025

Models That Prove Their Own Correctness

Noga Amit, Shafi Goldwasser, Orr Paradise, Guy Rothblum

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces Self-Proving models that can verify the correctness of their outputs with high probability using an interactive proof system, providing stronger guarantees than traditional accuracy metrics.

Contribution

It proposes a theoretically-founded framework for training models that can prove their output correctness via an interactive verifier, a novel approach to model reliability.

Findings

01

Self-Proving models can reliably verify most of their outputs.

02

Two generic training methods for Self-Proving models are introduced: Transcript Learning and Reinforcement Learning from Verifier Feedback.

03

The approach guarantees that incorrect outputs are detected by the verifier, enhancing trustworthiness.

Abstract

How can we trust the correctness of a learned model on a particular input of interest? Model accuracy is typically measured on average over a distribution of inputs, giving no guarantee for any fixed input. This paper proposes a theoretically-founded solution to this problem: to train Self-Proving models that prove the correctness of their output to a verification algorithm $V$ via an Interactive Proof. Self-Proving models satisfy that, with high probability over an input sampled from a given distribution, the model generates a correct output and successfully proves its correctness to $V$ . The soundness property of $V$ guarantees that, for every input, no model can convince $V$ of the correctness of an incorrect output. Thus, a Self-Proving model proves correctness of most of its outputs, while all incorrect outputs (of any model) are detected by $V$ . We devise and analyze two generic…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 8Confidence 4

Strengths

1. The idea of using a verifier from the theory of interactive proof systems for learning a self-proving model is very nice. It may lead to further interesting research activities that address the AI safety issue using several related tools from theoretical computer science, such as PCP and property testing etc. 2. The paper is written well. The discussion on related work helped me to understand what people had explored in the past, and to see the contributions of the paper more clearly. Also,

Weaknesses

I support the acceptance of this paper. The following points are mostly minor. 1. Having an example in addition to GCD would have convinced me of the promise of the authors' approach far more. As the authors pointed out, the proofs in this GCD case do not involve questions from the verifier, and so they are simple. Also, the annotated transcript learning is only vaguely defined, and it is only explained in terms of illustration in the example via the intermediate steps of the Euclid algorithm.

Reviewer 02Rating 6Confidence 3

Strengths

I find the overall approach principled and welcome. Moving the assessment of the answer of a model from test results to an estimate of the correctness on the individual query (and this not being provided by the model itself) is certainly welcome. I am not sure the overall setup is novel as such (see below); however I believe the error bounds on the two learning approaches are. The results are well backed by theory and the paper does a good job at making this challenging subject as accessible a

Weaknesses

It was not clear to me the extent to novelty of the overall setup of the pair prover/disprover. Obviously this is a well-known setup in many applications including those cited in the literature (perhaps the whole area of "Argumentation" in AI should also be included as it is not very far from here), but this particular instantiation with ML models and a formal verifier could be clarified. The verifier obviously has a fundamental role here. I might have missed this but the implications of this w

Reviewer 03Rating 3Confidence 3

Strengths

1. This is a well-motivated work. The use case shown in the introduction of the paper is attractive. 2. The proposed self-proving framework that proves the correctness of the answer by interactions between an autoregressive model and an external verifier seems a natural setting.

Weaknesses

1. Theorem 4.1, one of the major theoretical contributions of the paper, makes some strong assumptions. Therefore, I think the contribution of the theorem is limited. Firstly, it assumes that $A(\theta)$ is concave. I think this assumption does not hold for the typical autoregressive models used today, including the GPT model used in experiments. Moreover, the theorem assumes the existence of $\theta^\ast$ satisfying $A(\theta^\ast) \geq 1 - \epsilon/2$. This is also a strong assumption since i

Code & Models

Repositories

orrp/self-proving-models
pytorchOfficial

Videos

Models that prove their own correctness· youtube

Taxonomy

TopicsMachine Learning and Algorithms · Adversarial Robustness in Machine Learning · Topic Modeling