ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan; Weize Chen; Yusheng Su; Jianxuan Yu; Wei Xue; Shanghang; Zhang; Jie Fu; Zhiyuan Liu

arXiv:2308.07201·cs.CL·August 15, 2023·54 cites

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang, Zhang, Jie Fu, Zhiyuan Liu

PDF

Open Access 3 Repos 3 Reviews

TL;DR

ChatEval introduces a multi-agent debate framework where multiple LLMs collaboratively evaluate responses, improving the reliability and human-like quality of assessments for open-ended and NLG tasks.

Contribution

This work pioneers a multi-agent debate approach with LLMs acting as collaborative evaluators, surpassing single-agent methods in response quality assessment.

Findings

01

ChatEval achieves more reliable evaluations than single-agent models.

02

Multi-agent debate enhances assessment accuracy for complex NLG tasks.

03

The approach mimics human evaluation processes more closely.

Abstract

Text evaluation has historically posed significant challenges, often demanding substantial labor and time cost. With the emergence of large language models (LLMs), researchers have explored LLMs' potential as alternatives for human evaluation. While these single-agent-based approaches show promise, experimental results suggest that further advancements are needed to bridge the gap between their current effectiveness and human-level evaluation quality. Recognizing that best practices of human evaluation processes often involve multiple human annotators collaborating in the evaluation, we resort to a multi-agent debate framework, moving beyond single-agent prompting strategies. The multi-agent-based approach enables a group of LLMs to synergize with an array of intelligent counterparts, harnessing their distinct capabilities and expertise to enhance efficiency and effectiveness in…

Peer Reviews

Decision·ICLR 2024 poster

Reviewer 01Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. The idea of using LLMs for automatic text evaluation is intriguing and holds potential. 2. The paper is well-structured and easy to follow. The inclusion of informative figures and tables enhances clarity.

Weaknesses

1. The paper may benefit from providing more in-depth technical details about the ChatEval framework. While a prompt template is present in the appendix, the determination of LLM roles and the potential impact of varying roles and orders remain unclear. In addition, the framework relies on prompting LLMs, however, the paper lacks sufficient information on the design of prompts and their robustness. 2. The evaluation results present a challenge in assessing whether the multi-agent debate framewor

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 4

Strengths

1. The paper addresses an important problem of evaluating textual generations by using LLMs and trying to reduce the shortcomings faced with traditional human-based evaluations. 2. Results demonstrate the framework's effectiveness on the FairEval and Topical Chat dataset by comparing it against various other frameworks on this dataset.

Weaknesses

1. In the abstract, the authors mention that one of the drawbacks of using humans in the pipeline is the time and labor cost. The analysis of the proposed framework would benefit significantly if there is any analysis in terms of time spent by humans for annotation vs LLM. 2. The framework seems to be suited only for short conversations based on the Analysis in Section 4.3. This would be an issue when extending this framework for evaluating longer dialogue conversations.

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 4

Strengths

1. The authors carefully study the different components of the multi-agent system, including the communication pattern, agent role design, number of agents and discussion rounds. The additional experiments for individual components and detailed discussion are very helpful for understanding the design choices and inform future research. 2. Results on two benchmarks showcase improvement of using multi-agent with both ChatGPT and GPT-4.

Weaknesses

1. The experiments are conducted only on two datasets that are relative small. On Topical-Chat the win/loss of single-agent and multi-agent is mixed on different rubrics, and ChatGPT and GPT-4 seem to have different behaviors. Some more analysis on each rubric will be helpful rather than simply compare the improvement on average. 2. The experiments are conducted only with ChatGPT and GPT-4, thus unclear whether the proposed method could generalize to other LLMs, or limiting to GPT models. Evalua

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Multi-Agent Systems and Negotiation