From Jack of All Trades to Master of One: Specializing LLM-based Autoraters to a Test Set
Mara Finkelstein, Dan Deutsch, Parker Riley, Juraj Juraska, Geza, Kovacs, and Markus Freitag

TL;DR
This paper introduces a method to specialize LLM-based automatic evaluators for specific test sets using historical ratings, significantly improving their accuracy in machine translation evaluation.
Contribution
It proposes a novel specialization technique for LLM evaluators that leverages past ratings to enhance performance on fixed test sets, outperforming existing metrics.
Findings
Specialized evaluators outperform state-of-the-art metrics by over 50%.
The method is robust across different LLMs, test sets, and evaluation tasks.
Analysis reveals how rater variability influences evaluation accuracy.
Abstract
As LLMs continue to become more powerful and versatile, human evaluation has quickly become intractable at scale and reliance on automatic metrics has become the norm. Recently, it has been shown that LLMs are themselves state-of-the-art evaluators for many tasks. These Autoraters are typically designed so that they generalize to new systems and test sets. In practice, however, evaluation is performed on a small set of fixed, canonical test sets, which are carefully curated to measure certain capabilities of interest and are not changed frequently. In this work, we design a method which specializes a prompted Autorater to a given test set, by leveraging historical ratings on the test set to construct in-context learning (ICL) examples. We evaluate our Specialist method on the task of fine-grained machine translation evaluation, and show that it dramatically outperforms the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsArtificial Intelligence in Law · Law, AI, and Intellectual Property
MethodsSparse Evolutionary Training
