E-Scores for (In)Correctness Assessment of Generative Model Outputs

Guneet S. Dhillon; Javier Gonz\'alez; Teodora Pandeva; Alicia Curth

arXiv:2510.25770·stat.ML·April 2, 2026

E-Scores for (In)Correctness Assessment of Generative Model Outputs

Guneet S. Dhillon, Javier Gonz\'alez, Teodora Pandeva, Alicia Curth

PDF

TL;DR

This paper introduces e-scores, a new measure based on e-values, to assess the correctness of generative model outputs, providing flexible, reliable error bounds compared to traditional p-value-based methods.

Contribution

It proposes e-scores as an alternative to p-values for correctness assessment, enabling data-dependent tolerance levels with guaranteed error bounds.

Findings

01

E-scores effectively assess LLM correctness in factuality and property satisfaction.

02

They provide flexible, data-dependent error control with guaranteed upper bounds.

03

Experimental results show improved reliability over p-value-based methods.

Abstract

While generative models, especially large language models (LLMs), are ubiquitous in today's world, principled mechanisms to assess their (in)correctness are limited. Using the conformal prediction framework, previous works construct sets of LLM responses where the probability of including an incorrect response, or error, is capped at a user-defined tolerance level. However, since these methods are based on p-values, they are susceptible to p-hacking, i.e., choosing the tolerance level post-hoc can invalidate the guarantees. We therefore leverage e-values to complement generative model outputs with e-scores as measures of incorrectness. In addition to achieving the guarantees as before, e-scores further provide users with the flexibility of choosing data-dependent tolerance levels while upper bounding size distortion, a post-hoc notion of error. We experimentally demonstrate their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.