Cross-Model Disagreement as a Label-Free Correctness Signal
Matt Gorbett, Suman Jana

TL;DR
This paper introduces cross-model disagreement as a training-free, label-free method to detect when a language model is wrong, especially in confident error cases, improving deployment safety.
Contribution
The authors propose cross-model disagreement metrics, CMP and CME, that outperform traditional uncertainty measures in identifying model errors without additional training.
Findings
CMP achieves 0.75 AUROC on MMLU, outperforming baselines.
CME and CMP outperform within-model uncertainty on multiple benchmarks.
Method is applicable for deployment monitoring and model oversight.
Abstract
Detecting when a language model is wrong without ground truth labels is a fundamental challenge for safe deployment. Existing approaches rely on a model's own uncertainty -- such as token entropy or confidence scores -- but these signals fail critically on the most dangerous failure mode: confident errors, where a model is wrong but certain. In this work we introduce cross-model disagreement as a correctness indicator -- a simple, training-free signal that can be dropped into existing production systems, routing pipelines, and deployment monitoring infrastructure without modification. Given a model's generated answer, cross-model disagreement computes how surprised or uncertain a second verifier model is when reading that answer via a single forward pass. No generation from the verifying model is required, and no correctness labels are needed. We instantiate this principle as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSoftware System Performance and Reliability · Adversarial Robustness in Machine Learning · Topic Modeling
