Truthfulness Despite Weak Supervision: Evaluating and Training LLMs Using Peer Prediction
Tianyi Alex Qiu, Micah Carroll, Cameron Allen

TL;DR
This paper introduces a peer prediction method for evaluating and training large language models using weak supervision, effectively promoting honesty and resisting deception without relying on ground truth labels, with theoretical and empirical validation.
Contribution
It applies game-theoretic peer prediction to LLM evaluation and training, providing a novel approach that improves truthfulness and robustness against deception in weak supervision scenarios.
Findings
Peer prediction enhances truthfulness in LLM training.
The method is effective even with minimal supervision signals.
Resistance to deception increases with larger size gaps between models.
Abstract
The evaluation and post-training of large language models (LLMs) rely on supervision, but strong supervision for difficult tasks is often unavailable, especially when evaluating frontier models. In such cases, models are demonstrated to exploit evaluations built on such imperfect supervision, leading to deceptive results. However, underutilized in LLM research, a wealth of mechanism design research focuses on game-theoretic incentive compatibility, i.e., eliciting honest and informative answers with weak supervision. Drawing from this literature, we introduce the peer prediction method for model evaluation and post-training. It rewards honest and informative answers over deceptive and uninformative ones, using a metric based on mutual predictability and without requiring ground truth labels. We demonstrate the method's effectiveness and resistance to deception, with both theoretical…
Peer Reviews
Decision·ICLR 2026 Poster
1. The paper is well-written, and the core concepts and claims are explained clearly and supported with figures. The problem is well-motivated. I especially appreciate the FAQ section. 2. Scalable oversight, especially weak-to-strong oversight and generalization, is an important problem. The authors' proposal to use ideas in mechanism design and game theory to achieve some degree of weak-to-strong oversight in deception mitigation seems novel. 3. The main method is backed by a solid game-theoret
1. The current mechanism does not adequately address collusion. Since the method is being pitched as an ad-hoc fix to the deceptive behaviors of existing models, collusion among the game participants seems likely. 2. Adding participants could bring about a quadratic increase in the query costs to LLMs.
- Experiments include a wide range of models spanning 135M to 405B parameters and 37K questions from 8 different datasets - Section 3 provides theoretical properties about the proposed method: truthfulness is a Bayesian Nash equilibrium, and approximate incentive compatibility can be achieved with a large pool of agents with representative priors
- Algorithm 1's computational cost scales quadratically with the number of agents, which can be impractical when trying to have a large enough agent pool to achieve approximate incentive compatibility - Since the main focus is on incentive compatibility, I would've liked to see a more significant discussion of collusion. While collusion is briefly touched upon in the appendix, I would like to see a more in-depth explanation in the main paper
The paper is well written and easy to follow. The proposed framework is novel and effective, supported by theoretical results and empirical results. The peer prediction method is incentive compatible and resistant to deception and strategic manipulation. The framework also demonstrate peer prediction method is resilient to deception. It also supports recovery of truthfulness, i.e., the accuracy drop from deception training is recovered.
The cost of the proposed framework may be a concern and needs to be further explained. Algorithm 1 requires n^2m rounds of iteration which may be a big amount of computation.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Ethics and Social Impacts of AI · Explainable Artificial Intelligence (XAI)
