Faithful, Unfaithful or Ambiguous? Multi-Agent Debate with Initial Stance for Summary Evaluation
Mahnaz Koupaee, Jake W. Vincent, Saab Mansour, Igor Shalyminov, Han, He, Hwanjun Song, Raphael Shu, Jianfeng He, Yi Nian, Amy Wing-mei Wong, Kyu, J. Han, Hang Su

TL;DR
This paper introduces a multi-agent debate framework with initial stances for improved summary faithfulness evaluation, addressing issues of ambiguity and enhancing error detection in summaries.
Contribution
It proposes a novel multi-agent debate approach with initial stances to improve faithfulness evaluation and introduces a taxonomy for ambiguity in summaries.
Findings
Better identification of errors in summaries.
Enhanced detection of ambiguous summaries.
Stronger performance on non-ambiguous summaries.
Abstract
Faithfulness evaluators based on large language models (LLMs) are often fooled by the fluency of the text and struggle with identifying errors in the summaries. We propose an approach to summary faithfulness evaluation in which multiple LLM-based agents are assigned initial stances (regardless of what their belief might be) and forced to come up with a reason to justify the imposed belief, thus engaging in a multi-round debate to reach an agreement. The uniformly distributed initial assignments result in a greater diversity of stances leading to more meaningful debates and ultimately more errors identified. Furthermore, by analyzing the recent faithfulness evaluation datasets, we observe that naturally, it is not always the case for a summary to be either faithful to the source document or not. We therefore introduce a new dimension, ambiguity, and a detailed taxonomy to identify such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsComplex Systems and Decision Making · Experimental Behavioral Economics Studies
