LLM-Rubric: A Multidimensional, Calibrated Approach to Automated   Evaluation of Natural Language Texts

Helia Hashemi; Jason Eisner; Corby Rosset; Benjamin Van Durme; and Chris Kedzie

arXiv:2501.00274·cs.CL·January 3, 2025

LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts

Helia Hashemi, Jason Eisner, Corby Rosset, Benjamin Van Durme, and Chris Kedzie

PDF

1 Repo 1 Video

TL;DR

This paper presents LLM-Rubric, a framework that uses large language models and a calibrated neural network to automate and improve the evaluation of natural language texts across multiple dimensions, aligning closely with human judgments.

Contribution

Introduces LLM-Rubric, a novel method combining LLMs and neural calibration to predict human evaluations of text quality across multiple dimensions.

Findings

01

LLM-Rubric predicts human satisfaction with RMS error < 0.5

02

Achieves 2x improvement over uncalibrated baseline

03

Effective in evaluating dialogue system quality

Abstract

This paper introduces a framework for the automated evaluation of natural language texts. A manually constructed rubric describes how to assess multiple dimensions of interest. To evaluate a text, a large language model (LLM) is prompted with each rubric question and produces a distribution over potential responses. The LLM predictions often fail to agree well with human judges -- indeed, the humans do not fully agree with one another. However, the multiple LLM distributions can be $combined$ to $predict$ each human judge's annotations on all questions, including a summary question that assesses overall quality or relevance. LLM-Rubric accomplishes this by training a small feed-forward neural network that includes both judge-specific and judge-independent parameters. When evaluating dialogue systems in a human-AI information-seeking task, we find that LLM-Rubric with 9…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

microsoft/llm-rubric
pytorchOfficial

Videos

LLM-Rubric: A Multidimensional, Calibrated Approach to Automated Evaluation of Natural Language Texts· underline