E-Scores for (In)Correctness Assessment of Generative Model Outputs
Guneet S. Dhillon, Javier Gonz\'alez, Teodora Pandeva, Alicia Curth

TL;DR
This paper introduces e-scores, a new measure based on e-values, to assess the correctness of generative model outputs, providing flexible, reliable error bounds compared to traditional p-value-based methods.
Contribution
It proposes e-scores as an alternative to p-values for correctness assessment, enabling data-dependent tolerance levels with guaranteed error bounds.
Findings
E-scores effectively assess LLM correctness in factuality and property satisfaction.
They provide flexible, data-dependent error control with guaranteed upper bounds.
Experimental results show improved reliability over p-value-based methods.
Abstract
While generative models, especially large language models (LLMs), are ubiquitous in today's world, principled mechanisms to assess their (in)correctness are limited. Using the conformal prediction framework, previous works construct sets of LLM responses where the probability of including an incorrect response, or error, is capped at a user-defined tolerance level. However, since these methods are based on p-values, they are susceptible to p-hacking, i.e., choosing the tolerance level post-hoc can invalidate the guarantees. We therefore leverage e-values to complement generative model outputs with e-scores as measures of incorrectness. In addition to achieving the guarantees as before, e-scores further provide users with the flexibility of choosing data-dependent tolerance levels while upper bounding size distortion, a post-hoc notion of error. We experimentally demonstrate their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
