Generative Judge for Evaluating Alignment

Junlong Li; Shichao Sun; Weizhe Yuan; Run-Ze Fan; Hai Zhao; Pengfei; Liu

arXiv:2310.05470·cs.CL·December 8, 2023·5 cites

Generative Judge for Evaluating Alignment

Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, Pengfei, Liu

PDF

Open Access 1 Repo 1 Models 1 Datasets 3 Reviews

TL;DR

This paper introduces Auto-J, a 13B generative judge trained on diverse real-world scenarios to evaluate LLM alignment across various protocols, outperforming existing models in multiple benchmarks.

Contribution

The paper presents a novel generative judge model, Auto-J, capable of flexible, interpretable, and general evaluation of LLMs, addressing limitations of traditional evaluation methods.

Findings

01

Auto-J outperforms strong competitors in diverse evaluation scenarios.

02

The model effectively handles multiple evaluation protocols with natural language critiques.

03

Extensive analysis demonstrates Auto-J's potential for improving LLM alignment assessment.

Abstract

The rapid development of Large Language Models (LLMs) has substantially expanded the range of tasks they can address. In the field of Natural Language Processing (NLP), researchers have shifted their focus from conventional NLP tasks (e.g., sequence tagging and parsing) towards tasks that revolve around aligning with human needs (e.g., brainstorming and email writing). This shift in task distribution imposes new requirements on evaluating these aligned models regarding generality (i.e., assessing performance across diverse scenarios), flexibility (i.e., examining under different protocols), and interpretability (i.e., scrutinizing models with explanations). In this paper, we propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios and accommodates…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1). The design of scenario-specific criteria is strongly motivated, which will enable LLM-based judges to produce high-quality evaluations and critiques. Curated criterias can be model-agnostic and adopted to multiple models. 2). Comprehensive evaluation and analysis of Auto-J demonstrate that its evaluations are consistent and can align well with human judgements.

Weaknesses

The technical contribution is a bit limited as it is still within the scope of training one more LLM as judges to evaluate other LLMs’ generation. Since the training data is obtained from GPT4’s output, it is unsure whether it can replace GPT4 as judges or has strong generalizations as GPT4.

Reviewer 02Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

- Auto-J proposes a way to produce evaluation methods for LLMs

Weaknesses

- It is strange that larger models are used to evaluate other models - LLMs should somehow emulate human capabilities and not other LLMs' capabilities.

Reviewer 03Rating 6· marginally above the acceptance thresholdConfidence 2

Strengths

The paper provides an open-sourced model that can automatically judge a models' generated output; this could potentially enable more researchers to run automatic evaluation at a lower cost with higher reliability.

Weaknesses

- The papers is a large engineering effort (e.g., distilling GPT-4 for the task of evaluation) without much novel ideas (I do not think that a paper needs to be novel to be accepted, but this paper does score low in terms of novelty) - The presentation of the method and contribution feels very confusing to me (maybe it's just my fault). See questions below. I do not know whether other reviewers would have similar concerns though.

Code & Models

Repositories

gair-nlp/auto-j
pytorchOfficial

Models

🤗
pulze/intent-v0.1
model· ♡ 3
♡ 3

Datasets

pulze/intent-v0.1-dataset
dataset· 10 dl
10 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)

MethodsFocus