From Jack of All Trades to Master of One: Specializing LLM-based   Autoraters to a Test Set

Mara Finkelstein; Dan Deutsch; Parker Riley; Juraj Juraska; Geza; Kovacs; and Markus Freitag

arXiv:2411.15387·cs.CL·December 13, 2024

From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set

Mara Finkelstein, Dan Deutsch, Parker Riley, Juraj Juraska, Geza, Kovacs, and Markus Freitag

PDF

Open Access

TL;DR

This paper introduces a method to specialize LLM-based automatic evaluators for specific test sets using historical ratings, significantly improving their accuracy in machine translation evaluation.

Contribution

It proposes a novel specialization technique for LLM evaluators that leverages past ratings to enhance performance on fixed test sets, outperforming existing metrics.

Findings

01

Specialized evaluators outperform state-of-the-art metrics by over 50%.

02

The method is robust across different LLMs, test sets, and evaluation tasks.

03

Analysis reveals how rater variability influences evaluation accuracy.

Abstract

As LLMs continue to become more powerful and versatile, human evaluation has quickly become intractable at scale and reliance on automatic metrics has become the norm. Recently, it has been shown that LLMs are themselves state-of-the-art evaluators for many tasks. These Autoraters are typically designed so that they generalize to new systems and test sets. In practice, however, evaluation is performed on a small set of fixed, canonical test sets, which are carefully curated to measure certain capabilities of interest and are not changed frequently. In this work, we design a method which specializes a prompted Autorater to a given test set, by leveraging historical ratings on the test set to construct in-context learning (ICL) examples. We evaluate our Specialist method on the task of fine-grained machine translation evaluation, and show that it dramatically outperforms the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Law · Law, AI, and Intellectual Property

MethodsSparse Evolutionary Training