Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation
Tu Vu, Kalpesh Krishna, Salaheddin Alzubi, Chris Tar, Manaal Faruqui,, Yun-Hsuan Sung

TL;DR
The paper introduces FLAMe, a family of large autorater models trained on diverse human judgments, significantly improving automatic evaluation of LLM outputs across multiple benchmarks and reducing bias compared to proprietary models.
Contribution
We develop FLAMe, a new large-scale autorater model trained on over 5 million human judgments, enhancing generalization and performance in automatic LLM evaluation tasks.
Findings
FLAMe outperforms proprietary models like GPT-4 on many benchmarks.
FLAMe achieves 87.8% accuracy on RewardBench with a 24B parameter model.
Tail-patch fine-tuning reduces training data needs by 25x with competitive results.
Abstract
As large language models (LLMs) advance, it becomes more challenging to reliably evaluate their output due to the high costs of human evaluation. To make progress towards better LLM autoraters, we introduce FLAMe, a family of Foundational Large Autorater Models. FLAMe is trained on our large and diverse collection of 100+ quality assessment tasks comprising 5M+ human judgments, curated and standardized using publicly released human evaluations from previous research. FLAMe significantly improves generalization to a wide variety of held-out tasks, outperforming LLMs trained on proprietary data like GPT-4 and Claude-3 on many tasks. We show that FLAMe can also serve as a powerful starting point for further downstream fine-tuning, using reward modeling evaluation as a case study (FLAMe-RM). Notably, on RewardBench, our FLAMe-RM-24B model (with an accuracy of 87.8%) is the top-performing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Adam · Dropout · Multi-Head Attention · Dense Connections
