AutoEval Done Right: Using Synthetic Data for Model Evaluation

Pierre Boyeau; Anastasios N. Angelopoulos; Nir Yosef; Jitendra Malik,; Michael I. Jordan

arXiv:2403.07008·cs.LG·May 29, 2024·1 cites

AutoEval Done Right: Using Synthetic Data for Model Evaluation

Pierre Boyeau, Anastasios N. Angelopoulos, Nir Yosef, Jitendra Malik,, Michael I. Jordan

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces statistically principled algorithms for autoevaluation using synthetic data, significantly reducing the need for human annotations and increasing sample efficiency in model evaluation.

Contribution

It presents novel algorithms that improve sample efficiency and remain unbiased for autoevaluation with synthetic data, demonstrated on GPT-4 experiments.

Findings

01

Sample efficiency increased by up to 50% with synthetic data.

02

Algorithms remain unbiased while improving evaluation efficiency.

03

Effective reduction in human annotation requirements.

Abstract

The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pierreboyeau/autoeval
jaxOfficial

Videos

AutoEval Done Right: Using Synthetic Data for Model Evaluation· slideslive

Taxonomy

TopicsSimulation Techniques and Applications

MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Layer Normalization · Absolute Position Encodings · Residual Connection · Dropout · Softmax · Linear Layer · Multi-Head Attention