Estimating LLM Consistency: A User Baseline vs Surrogate Metrics
Xiaoyuan Wu, Weiran Lin, Omer Akgul, and Lujo Bauer

TL;DR
This paper evaluates how well automated metrics for LLM consistency align with human perceptions, finding current methods often fall short and advocating for more human-involved evaluation approaches.
Contribution
The paper introduces a logit-based ensemble method for estimating LLM consistency and demonstrates its comparable performance to existing metrics in reflecting human judgments.
Findings
Current automated metrics poorly align with human perceptions.
The proposed logit-based ensemble method matches the best existing metrics.
Human evaluation remains crucial for accurate LLM consistency assessment.
Abstract
Large language models (LLMs) are prone to hallucinations and sensitive to prompt perturbations, often resulting in inconsistent or unreliable generated text. Different methods have been proposed to mitigate such hallucinations and fragility, one of which is to measure the consistency of LLM responses -- the model's confidence in the response or likelihood of generating a similar response when resampled. In previous work, measuring LLM response consistency often relied on calculating the probability of a response appearing within a pool of resampled responses, analyzing internal states, or evaluating logits of responses. However, it was not clear how well these approaches approximated users' perceptions of consistency of LLM responses. To find out, we performed a user study () demonstrating that current methods for measuring LLM response consistency typically do not align well…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsSemantic Web and Ontologies
