Compare without Despair: Reliable Preference Evaluation with Generation Separability
Sayan Ghosh, Tejas Srinivasan, Swabha Swayamdipta

TL;DR
This paper introduces a measure called separability to improve the consistency and reliability of preference evaluations of language models, especially in challenging scenarios with similar or variable outputs.
Contribution
The paper proposes a novel meta-evaluation measure, separability, to assess and enhance the reliability of preference judgments in language model evaluation.
Findings
High separability instances yield more consistent ratings.
Separability distribution provides insights into benchmark quality.
Incorporating separability improves LLM ranking accuracy.
Abstract
Human evaluation of generated language through pairwise preference judgments is pervasive. However, under common scenarios, such as when generations from a model pair are very similar, or when stochastic decoding results in large variations in generations, it results in inconsistent preference ratings. We address these challenges by introducing a meta-evaluation measure, separability, which estimates how suitable a test instance is for pairwise preference evaluation. For a candidate test instance, separability samples multiple generations from a pair of models, and measures how distinguishable the two sets of generations are. Our experiments show that instances with high separability values yield more consistent preference ratings from both human- and auto-raters. Further, the distribution of separability allows insights into which test benchmarks are more valuable for comparing models.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsEconomic and Environmental Valuation · Multi-Criteria Decision Making · Decision-Making and Behavioral Economics
