Foundational Autoraters: Taming Large Language Models for Better   Automatic Evaluation

Tu Vu; Kalpesh Krishna; Salaheddin Alzubi; Chris Tar; Manaal Faruqui,; Yun-Hsuan Sung

arXiv:2407.10817·cs.CL·July 16, 2024

Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation

Tu Vu, Kalpesh Krishna, Salaheddin Alzubi, Chris Tar, Manaal Faruqui,, Yun-Hsuan Sung

PDF

Open Access 1 Video

TL;DR

The paper introduces FLAMe, a family of large autorater models trained on diverse human judgments, significantly improving automatic evaluation of LLM outputs across multiple benchmarks and reducing bias compared to proprietary models.

Contribution

We develop FLAMe, a new large-scale autorater model trained on over 5 million human judgments, enhancing generalization and performance in automatic LLM evaluation tasks.

Findings

01

FLAMe outperforms proprietary models like GPT-4 on many benchmarks.

02

FLAMe achieves 87.8% accuracy on RewardBench with a 24B parameter model.

03

Tail-patch fine-tuning reduces training data needs by 25x with competitive results.

Abstract

As large language models (LLMs) advance, it becomes more challenging to reliably evaluate their output due to the high costs of human evaluation. To make progress towards better LLM autoraters, we introduce FLAMe, a family of Foundational Large Autorater Models. FLAMe is trained on our large and diverse collection of 100+ quality assessment tasks comprising 5M+ human judgments, curated and standardized using publicly released human evaluations from previous research. FLAMe significantly improves generalization to a wide variety of held-out tasks, outperforming LLMs trained on proprietary data like GPT-4 and Claude-3 on many tasks. We show that FLAMe can also serve as a powerful starting point for further downstream fine-tuning, using reward modeling evaluation as a case study (FLAMe-RM). Notably, on RewardBench, our FLAMe-RM-24B model (with an accuracy of 87.8%) is the top-performing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Foundational Autoraters: Taming Large Language Models for Better Automatic Evaluation· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsAttention Is All You Need · Residual Connection · Byte Pair Encoding · Layer Normalization · Label Smoothing · Linear Layer · Adam · Dropout · Multi-Head Attention · Dense Connections