PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation

Abhishek Divekar; Anirban Majumder

arXiv:2601.18777·cs.LG·January 27, 2026

PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation

Abhishek Divekar, Anirban Majumder

PDF

Open Access 1 Video

TL;DR

PRECISE introduces a statistical framework that combines minimal human annotations with LLM judgments to accurately estimate search and ranking metrics, significantly reducing annotation effort and correcting LLM bias.

Contribution

It extends Prediction-Powered Inference to incorporate sub-instance annotations, enabling reliable metric estimation with fewer annotations and lower computational complexity.

Findings

01

Reduces annotation needs to 100 human queries and 10,000 unlabeled examples.

02

Effectively corrects LLM bias in low-resource settings.

03

Reduces variance of Precision@K estimates across datasets.

Abstract

Evaluating the quality of search, ranking and RAG systems traditionally requires a significant number of human relevance annotations. In recent times, several deployed systems have explored the usage of Large Language Models (LLMs) as automated judges for this task while their inherent biases prevent direct use for metric estimation. We present a statistical framework extending Prediction-Powered Inference (PPI) that combines minimal human annotations with LLM judgments to produce reliable estimates of metrics which require sub-instance annotations. Our method requires as few as 100 human-annotated queries and 10,000 unlabeled examples, reducing annotation requirements significantly compared to traditional approaches. We formulate our proposed framework (PRECISE) for inference of relevance uplift for an LLM-based query reformulation application, extending PPI to sub-instance annotations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

PRECISE: Reducing the Bias of LLM Evaluations Using Prediction-Powered Ranking Estimation· underline

Taxonomy

TopicsInformation Retrieval and Search Behavior · Topic Modeling · Natural Language Processing Techniques