Graders should cheat: privileged information enables expert-level automated evaluations
Jin Peng Zhou, S\'ebastien M. R. Arnold, Nan Ding, Kilian Q., Weinberger, Nan Hua, Fei Sha

TL;DR
Providing privileged information like ground-truth solutions enables language model graders to reliably evaluate complex problems beyond their usual capabilities, expanding their applicability and improving evaluation accuracy.
Contribution
The paper demonstrates that privileged information significantly enhances LM graders' ability to evaluate challenging problems, surpassing previous methods and matching expert-level assessments.
Findings
Privileged information improves evaluation accuracy on complex problems.
LM graders outperform human raters on certain benchmarks.
Approach extends the applicability of automated evaluation methods.
Abstract
Auto-evaluating language models (LMs), i.e., using a grader LM to evaluate the candidate LM, is an appealing way to accelerate the evaluation process and the cost associated with it. But this presents a paradox: how can we trust the grader LM, which is presumably weaker than the candidate LM, to assess problems that are beyond the frontier of the capabilities of either model or both? For instance, today's LMs struggle on graduate-level physics and Olympiad-level math, making them unreliable graders in these domains. We show that providing privileged information -- such as ground-truth solutions or problem-specific guidelines -- improves automated evaluations on such frontier problems. This approach offers two key advantages. First, it expands the range of problems where LMs graders apply. Specifically, weaker models can now rate the predictions of stronger models. Second, privileged…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsStudent Assessment and Feedback
