Stratified Prediction-Powered Inference for Hybrid Language Model   Evaluation

Adam Fisch; Joshua Maynez; R. Alex Hofer; Bhuwan Dhingra; Amir; Globerson; William W. Cohen

arXiv:2406.04291·cs.LG·December 5, 2024·2 cites

Stratified Prediction-Powered Inference for Hybrid Language Model Evaluation

Adam Fisch, Joshua Maynez, R. Alex Hofer, Bhuwan Dhingra, Amir, Globerson, William W. Cohen

PDF

Open Access

TL;DR

This paper introduces StratPPI, a stratified inference method that enhances prediction-powered inference for language model evaluation by providing tighter confidence intervals through data stratification, without assumptions on data distribution.

Contribution

The paper proposes StratPPI, a stratification-based method that improves confidence interval estimation in prediction-powered inference for language models, with theoretical guarantees and empirical validation.

Findings

01

StratPPI yields significantly tighter confidence intervals than unstratified methods.

02

The approach is robust without assumptions on data distribution or automatic labeling bias.

03

Empirical results demonstrate improved evaluation accuracy for language models.

Abstract

Prediction-powered inference (PPI) is a method that improves statistical estimates based on limited human-labeled data. PPI achieves this by combining small amounts of human-labeled data with larger amounts of data labeled by a reasonably accurate -- but potentially biased -- automatic system, in a way that results in tighter confidence intervals for certain parameters of interest (e.g., the mean performance of a language model). In this paper, we propose a method called Stratified Prediction-Powered Inference (StratPPI), in which we show that the basic PPI estimates can be considerably improved by employing simple data stratification strategies. Without making any assumptions on the underlying automatic labeling system or data distribution, we derive an algorithm for computing provably valid confidence intervals for population parameters (such as averages) that is based on stratified…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling