PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation
Abhishek Divekar, Anirban Majumder

TL;DR
PRECISE introduces a statistical framework that combines minimal human annotations with LLM judgments to accurately estimate search and ranking metrics, significantly reducing annotation effort and correcting LLM bias.
Contribution
It extends Prediction-Powered Inference to incorporate sub-instance annotations, enabling reliable metric estimation with fewer annotations and lower computational complexity.
Findings
Reduces annotation needs to 100 human queries and 10,000 unlabeled examples.
Effectively corrects LLM bias in low-resource settings.
Reduces variance of Precision@K estimates across datasets.
Abstract
Evaluating the quality of search, ranking and RAG systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) that combines minimal human annotations with LLM judgments to produce reliable estimates of metrics which require sub-instance annotations. Our method requires as few as 100 human-annotated queries and 10,000 unlabeled examples, reducing annotation requirements significantly compared to traditional approaches. We formulate our proposed framework (PRECISE) for inference of relevance uplift for an LLM-based query reformulation application, extending PPI to sub-instance annotations…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsInformation Retrieval and Search Behavior · Topic Modeling · Natural Language Processing Techniques
