Generative Judge for Evaluating Alignment
Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, Pengfei, Liu

TL;DR
This paper introduces Auto-J, a 13B generative judge trained on diverse real-world scenarios to evaluate LLM alignment across various protocols, outperforming existing models in multiple benchmarks.
Contribution
The paper presents a novel generative judge model, Auto-J, capable of flexible, interpretable, and general evaluation of LLMs, addressing limitations of traditional evaluation methods.
Findings
Auto-J outperforms strong competitors in diverse evaluation scenarios.
The model effectively handles multiple evaluation protocols with natural language critiques.
Extensive analysis demonstrates Auto-J's potential for improving LLM alignment assessment.
Abstract
The rapid development of Large Language Models (LLMs) has substantially expanded the range of tasks they can address. In the field of Natural Language Processing (NLP), researchers have shifted their focus from conventional NLP tasks (e.g., sequence tagging and parsing) towards tasks that revolve around aligning with human needs (e.g., brainstorming and email writing). This shift in task distribution imposes new requirements on evaluating these aligned models regarding generality (i.e., assessing performance across diverse scenarios), flexibility (i.e., examining under different protocols), and interpretability (i.e., scrutinizing models with explanations). In this paper, we propose a generative judge with 13B parameters, Auto-J, designed to address these challenges. Our model is trained on user queries and LLM-generated responses under massive real-world scenarios and accommodates…
Peer Reviews
Decision·ICLR 2024 poster
1). The design of scenario-specific criteria is strongly motivated, which will enable LLM-based judges to produce high-quality evaluations and critiques. Curated criterias can be model-agnostic and adopted to multiple models. 2). Comprehensive evaluation and analysis of Auto-J demonstrate that its evaluations are consistent and can align well with human judgements.
The technical contribution is a bit limited as it is still within the scope of training one more LLM as judges to evaluate other LLMs’ generation. Since the training data is obtained from GPT4’s output, it is unsure whether it can replace GPT4 as judges or has strong generalizations as GPT4.
- Auto-J proposes a way to produce evaluation methods for LLMs
- It is strange that larger models are used to evaluate other models - LLMs should somehow emulate human capabilities and not other LLMs' capabilities.
The paper provides an open-sourced model that can automatically judge a models' generated output; this could potentially enable more researchers to run automatic evaluation at a lower cost with higher reliability.
- The papers is a large engineering effort (e.g., distilling GPT-4 for the task of evaluation) without much novel ideas (I do not think that a paper needs to be novel to be accepted, but this paper does score low in terms of novelty) - The presentation of the method and contribution feels very confusing to me (maybe it's just my fault). See questions below. I do not know whether other reviewers would have similar concerns though.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Explainable Artificial Intelligence (XAI)
MethodsFocus
