AutoEval Done Right: Using Synthetic Data for Model Evaluation
Pierre Boyeau, Anastasios N. Angelopoulos, Nir Yosef, Jitendra Malik,, Michael I. Jordan

TL;DR
This paper introduces statistically principled algorithms for autoevaluation using synthetic data, significantly reducing the need for human annotations and increasing sample efficiency in model evaluation.
Contribution
It presents novel algorithms that improve sample efficiency and remain unbiased for autoevaluation with synthetic data, demonstrated on GPT-4 experiments.
Findings
Sample efficiency increased by up to 50% with synthetic data.
Algorithms remain unbiased while improving evaluation efficiency.
Effective reduction in human annotation requirements.
Abstract
The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsSimulation Techniques and Applications
MethodsAttention Is All You Need · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Layer Normalization · Absolute Position Encodings · Residual Connection · Dropout · Softmax · Linear Layer · Multi-Head Attention
