Evaluating the Evaluator: Measuring LLMs' Adherence to Task Evaluation Instructions
Bhuvanashree Murugadoss, Christian Poelitz, Ian Drosos, Vu Le, Nick McKenna, Carina Suzana Negreanu, Chris Parnin, Advait Sarkar

TL;DR
This paper investigates how well large language models (LLMs) as evaluators adhere to task instructions and compares prompt-based evaluation with perplexity-based methods, revealing that detailed prompts have limited impact and perplexity can sometimes be more aligned with human judgments.
Contribution
The study provides a systematic analysis of LLMs-as-evaluators, introduces a benchmark taxonomy of quality criteria, and compares prompt-based and perplexity-based evaluation methods.
Findings
Limited benefit from detailed instructions in prompts.
Perplexity sometimes aligns better with human judgments.
Proposed a taxonomy for evaluating LLMs as judges.
Abstract
LLMs-as-a-judge is a recently popularized method which replaces human judgements in task evaluation (Zheng et al. 2024) with automatic evaluation using LLMs. Due to widespread use of RLHF (Reinforcement Learning from Human Feedback), state-of-the-art LLMs like GPT4 and Llama3 are expected to have strong alignment with human preferences when prompted for a quality judgement, such as the coherence of a text. While this seems beneficial, it is not clear whether the assessments by an LLM-as-a-judge constitute only an evaluation based on the instructions in the prompts, or reflect its preference for high-quality data similar to its fine-tune data. To investigate how much influence prompting the LLMs-as-a-judge has on the alignment of AI judgements to human judgements, we analyze prompts with increasing levels of instructions about the target quality of an evaluation, for several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Resource Development and Performance Evaluation
MethodsALIGN
