Compare without Despair: Reliable Preference Evaluation with Generation   Separability

Sayan Ghosh; Tejas Srinivasan; Swabha Swayamdipta

arXiv:2407.01878·cs.CL·October 30, 2024

Compare without Despair: Reliable Preference Evaluation with Generation Separability

Sayan Ghosh, Tejas Srinivasan, Swabha Swayamdipta

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a measure called separability to improve the consistency and reliability of preference evaluations of language models, especially in challenging scenarios with similar or variable outputs.

Contribution

The paper proposes a novel meta-evaluation measure, separability, to assess and enhance the reliability of preference judgments in language model evaluation.

Findings

01

High separability instances yield more consistent ratings.

02

Separability distribution provides insights into benchmark quality.

03

Incorporating separability improves LLM ranking accuracy.

Abstract

Human evaluation of generated language through pairwise preference judgments is pervasive. However, under common scenarios, such as when generations from a model pair are very similar, or when stochastic decoding results in large variations in generations, it results in inconsistent preference ratings. We address these challenges by introducing a meta-evaluation measure, separability, which estimates how suitable a test instance is for pairwise preference evaluation. For a candidate test instance, separability samples multiple generations from a pair of models, and measures how distinguishable the two sets of generations are. Our experiments show that instances with high separability values yield more consistent preference ratings from both human- and auto-raters. Further, the distribution of separability allows insights into which test benchmarks are more valuable for comparing models.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dill-lab/separability
noneOfficial

Videos

Compare without Despair: Reliable Preference Evaluation with Generation Separability· underline

Taxonomy

TopicsEconomic and Environmental Valuation · Multi-Criteria Decision Making · Decision-Making and Behavioral Economics