Accelerating Unbiased LLM Evaluation via Synthetic Feedback

Zhaoyi Zhou; Yuda Song; Andrea Zanette

arXiv:2502.10563·cs.LG·February 26, 2025

Accelerating Unbiased LLM Evaluation via Synthetic Feedback

Zhaoyi Zhou, Yuda Song, Andrea Zanette

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a statistically principled framework that combines human and synthetic feedback to evaluate large language models more efficiently, reducing reliance on costly human annotations while maintaining unbiased performance metrics.

Contribution

It proposes a novel method to integrate human and synthetic feedback for unbiased LLM evaluation, significantly reducing the need for human annotations.

Findings

01

Up to 12.2% reduction in human annotations with synthetic evaluators.

02

Up to 24.8% reduction with a finetuned synthetic evaluator.

03

Method is scalable, generalizable, and free of hyper-parameter tuning.

Abstract

When developing new large language models (LLMs), a key step is evaluating their final performance, often by computing the win-rate against a reference model based on external feedback. Human feedback is the gold standard, particularly for capturing nuanced qualities like coherence, readability, and alignment with human expectations. However, human evaluations are costly -- even for large tech companies -- and when conducted with active users, they may negatively impact user experience. A promising alternative is synthetic feedback, where evaluations are conducted by other large language models, including reward models. While this eliminates the need for costly human annotations, it introduces biases that may distort the evaluation process. In this work, we propose a statistically principled framework that integrates human and synthetic feedback to reduce reliance on human annotations…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Zanette-Labs/control_variates_evaluation
pytorchOfficial

Videos

Accelerating Unbiased LLM Evaluation via Synthetic Feedback· slideslive

Taxonomy

TopicsNon-Destructive Testing Techniques