Benchmarking LLMs' Judgments with No Gold Standard

Shengwei Xu; Yuxuan Lu; Grant Schoenebeck; Yuqing Kong

arXiv:2411.07127·cs.CL·April 30, 2025

Benchmarking LLMs' Judgments with No Gold Standard

Shengwei Xu, Yuxuan Lu, Grant Schoenebeck, Yuqing Kong

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces GEM, a new evaluation metric for LLMs that assesses their judgment quality without relying on gold standard references, and presents GRE-bench, a benchmark for peer review generation.

Contribution

The paper proposes GEM, a mutual information-based metric for evaluating LLMs in subjective tasks without gold standards, and introduces GRE-bench, a new benchmark for peer review quality assessment.

Findings

01

GEM correlates well with human judgments and outperforms existing metrics.

02

GEM is robust against strategic manipulations like rephrasing.

03

GRE-bench effectively evaluates LLM peer review generation using new research data.

Abstract

We introduce the GEM (Generative Estimator for Mutual Information), an evaluation metric for assessing language generation by Large Language Models (LLMs), particularly in generating informative judgments, without the need for a gold standard reference. GEM broadens the scenarios where we can benchmark LLM generation performance-from traditional ones, like machine translation and summarization, where gold standard references are readily available, to subjective tasks without clear gold standards, such as academic peer review. GEM uses a generative model to estimate mutual information between candidate and reference responses, without requiring the reference to be a gold standard. In experiments on a human-annotated dataset, GEM demonstrates competitive correlations with human scores compared to the state-of-the-art GPT-4o Examiner, and outperforms all other baselines. Additionally,…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

Using mutual information for evaluation subjective generative response is novel as my understanding. The paper give theoretic guarantees of the proposed method. Experiments show the method works well on the task of evaluating reviews for academic papers or student proposals.

Weaknesses

Only one type of evaluation tasks has been assessed: paper review evaluation. It would be more convicing if the paper can conduct experiments of more subjective evaluation tasks, such as summarization, dialog, etc. The paper uses Llama 3.1 8B as the evaluation LM. I would like to see how different LMs effect the performance of the evaluation. Figure 4 is not mentioned in the paper.

Reviewer 02Rating 6Confidence 3

Strengths

The paper is clearly written and rigorously formulated. The research topic is very timely and important, given the increasing use of LLMs and the challenges of evaluating their performance in subjective tasks.

Weaknesses

I have some questions and concerns: 1. Impact of Suboptimal Human References: I am concerned about GEM's reliability when human references are of lower quality than LLM outputs. Could the LLM used for estimating probability distributions learn to favor responses that resemble flawed human references, even if they are less informative, potentially leading to inaccurate scores? 2. Performance stability according to the choice of evaluation-LM and preprocessing LLM: How sensitive is GEM to the c

Reviewer 03Rating 6Confidence 4

Strengths

- The paper is well-written, and the problem statements is well-motivated. The authors provide cogent and coherent arguments throughout the paper. I personally think this is a very valid issue for LLM-based evaluators as not always we have a single human-verified gold-standard reference, but a set of good-enough responses to aggregate and use as a ground-truth. The paper also has a good mathematical rigor to prove its hypothesis. - The benchmark created in the dataset is a plus point as it can

Weaknesses

- I do not see any blaring errors in the paper, however, I am a suspicious about the experimental setup as such. Bottom line, the problem statement discusses a **very challenging** problem: _LLMs generating "judicious" reviews for very long and complex scientific papers_. I am not sure if most of the existing LLMs are good enough for such a humongous task and if they do provide good results which can be analyses with some guarantee, as the LLMs are required to have a decent knowledge of the doma

Code & Models

Repositories

yx-lu/benchmarking-llms--judgments-with-no-gold-standard
noneOfficial

Videos

Benchmarking LLMs' Judgments with No Gold Standard· slideslive

Taxonomy

TopicsTaxation and Legal Issues