TL;DR
This paper presents LLM-Rubric, a framework that uses large language models and a calibrated neural network to automate and improve the evaluation of natural language texts across multiple dimensions, aligning closely with human judgments.
Contribution
Introduces LLM-Rubric, a novel method combining LLMs and neural calibration to predict human evaluations of text quality across multiple dimensions.
Findings
LLM-Rubric predicts human satisfaction with RMS error < 0.5
Achieves 2x improvement over uncalibrated baseline
Effective in evaluating dialogue system quality
Abstract
This paper introduces a framework for the automated evaluation of natural language texts. A manually constructed rubric describes how to assess multiple dimensions of interest. To evaluate a text, a large language model (LLM) is prompted with each rubric question and produces a distribution over potential responses. The LLM predictions often fail to agree well with human judges -- indeed, the humans do not fully agree with one another. However, the multiple LLM distributions can be to each human judge's annotations on all questions, including a summary question that assesses overall quality or relevance. LLM-Rubric accomplishes this by training a small feed-forward neural network that includes both judge-specific and judge-independent parameters. When evaluating dialogue systems in a human-AI information-seeking task, we find that LLM-Rubric with 9…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
