TL;DR
This paper investigates the inconsistency of large language models when used as evaluators for natural language generation, highlighting their low reliability and the implications for assessment accuracy.
Contribution
It reveals the self-inconsistency problem in LLM-based evaluation frameworks and quantifies this issue across various NLG tasks and benchmarks.
Findings
LLM judges show low intra-rater reliability across different runs.
Inconsistency can lead to arbitrary ratings, affecting evaluation trustworthiness.
Proper guidelines may mitigate some issues, but the core problem persists.
Abstract
As Natural Language Generation (NLG) continues to be widely adopted, properly assessing it has become quite difficult. Lately, using large language models (LLMs) for evaluating these generations has gained traction, as they tend to align more closely with human preferences than conventional n-gram or embedding-based metrics. In our experiments, we show that LLM judges have low intra-rater reliability in their assigned scores across different runs. This variance makes their ratings inconsistent, almost arbitrary in the worst case, making it difficult to measure how good their judgments actually are. We quantify this inconsistency across different NLG tasks and benchmarks and see if judicious use of LLM judges can still be useful following proper guidelines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
