Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks
Annalisa Szymanski, Noah Ziems, Heather A. Eicher-Miller, Toby Jia-Jun, Li, Meng Jiang, Ronald A. Metoyer

TL;DR
This study investigates the limitations of using Large Language Models as sole evaluators for expert knowledge tasks, revealing moderate agreement with human experts and emphasizing the need for human oversight.
Contribution
The paper provides empirical evidence on the limited reliability of LLMs as judges in domain-specific evaluations, highlighting the importance of human involvement.
Findings
SMEs agreed with LLM judgments 68% in dietetics and 64% in mental health.
Agreement varies across different evaluation aspects.
LLMs alone may lack the depth needed for complex, knowledge-specific assessments.
Abstract
The potential of using Large Language Models (LLMs) themselves to evaluate LLM outputs offers a promising method for assessing model performance across various contexts. Previous research indicates that LLM-as-a-judge exhibits a strong correlation with human judges in the context of general instruction following. However, for instructions that require specialized knowledge, the validity of using LLMs as judges remains uncertain. In our study, we applied a mixed-methods approach, conducting pairwise comparisons in which both subject matter experts (SMEs) and LLMs evaluated outputs from domain-specific tasks. We focused on two distinct fields: dietetics, with registered dietitian experts, and mental health, with clinical psychologist experts. Our results showed that SMEs agreed with LLM judges 68% of the time in the dietetics domain and 64% in mental health when evaluating overall…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsQuality and Management Systems · Medical Malpractice and Liability Issues · Risk and Safety Analysis
