Graders should cheat: privileged information enables expert-level   automated evaluations

Jin Peng Zhou; S\'ebastien M. R. Arnold; Nan Ding; Kilian Q.; Weinberger; Nan Hua; Fei Sha

arXiv:2502.10961·cs.LG·February 18, 2025

Graders should cheat: privileged information enables expert-level automated evaluations

Jin Peng Zhou, S\'ebastien M. R. Arnold, Nan Ding, Kilian Q., Weinberger, Nan Hua, Fei Sha

PDF

Open Access 1 Video

TL;DR

Providing privileged information like ground-truth solutions enables language model graders to reliably evaluate complex problems beyond their usual capabilities, expanding their applicability and improving evaluation accuracy.

Contribution

The paper demonstrates that privileged information significantly enhances LM graders' ability to evaluate challenging problems, surpassing previous methods and matching expert-level assessments.

Findings

01

Privileged information improves evaluation accuracy on complex problems.

02

LM graders outperform human raters on certain benchmarks.

03

Approach extends the applicability of automated evaluation methods.

Abstract

Auto-evaluating language models (LMs), i.e., using a grader LM to evaluate the candidate LM, is an appealing way to accelerate the evaluation process and the cost associated with it. But this presents a paradox: how can we trust the grader LM, which is presumably weaker than the candidate LM, to assess problems that are beyond the frontier of the capabilities of either model or both? For instance, today's LMs struggle on graduate-level physics and Olympiad-level math, making them unreliable graders in these domains. We show that providing privileged information -- such as ground-truth solutions or problem-specific guidelines -- improves automated evaluations on such frontier problems. This approach offers two key advantages. First, it expands the range of problems where LMs graders apply. Specifically, weaker models can now rate the predictions of stronger models. Second, privileged…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Graders Should Cheat: Privileged Information Enables Expert-Level Automated Evaluations· underline

Taxonomy

TopicsStudent Assessment and Feedback