Estimating LLM Consistency: A User Baseline vs Surrogate Metrics

Xiaoyuan Wu; Weiran Lin; Omer Akgul; and Lujo Bauer

arXiv:2505.23799·cs.CL·November 25, 2025

Estimating LLM Consistency: A User Baseline vs Surrogate Metrics

Xiaoyuan Wu, Weiran Lin, Omer Akgul, and Lujo Bauer

PDF

Open Access 1 Video

TL;DR

This paper evaluates how well automated metrics for LLM consistency align with human perceptions, finding current methods often fall short and advocating for more human-involved evaluation approaches.

Contribution

The paper introduces a logit-based ensemble method for estimating LLM consistency and demonstrates its comparable performance to existing metrics in reflecting human judgments.

Findings

01

Current automated metrics poorly align with human perceptions.

02

The proposed logit-based ensemble method matches the best existing metrics.

03

Human evaluation remains crucial for accurate LLM consistency assessment.

Abstract

Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility, one of which is to measure the consistency of LLM responses -- the model's confidence in the response or likelihood of generating a similar response when resampled. In previous work, measuring LLM response consistency often relied on calculating the probability of a response appearing within a pool of resampled responses, analyzing internal states, or evaluating logits of responses. However, it was not clear how well these approaches approximated users' perceptions of consistency of LLM responses. To find out, we performed a user study ( $n = 2, 976$ ) demonstrating that current methods for measuring LLM response consistency typically do not align well…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Estimating LLM Consistency: A User Baseline vs Surrogate Metrics· underline

Taxonomy

TopicsSemantic Web and Ontologies