TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Liyan Tang, Igor Shalyminov, Amy Wing-mei Wong, Jon Burnsky, Jake W., Vincent, Yu'an Yang, Siffi Singh, Song Feng, Hwanjun Song, Hang Su, Lijia, Sun, Yi Zhang, Saab Mansour, Kathleen McKeown

TL;DR
This paper introduces TofuEval, a benchmark for evaluating hallucinations in topic-focused dialogue summarization by LLMs, revealing significant factual errors and limitations of LLM evaluators compared to specialized metrics.
Contribution
It presents a new benchmark with human annotations for factual consistency in dialogue summarization and analyzes the performance of LLMs and metrics in detecting hallucinations.
Findings
LLMs hallucinate many factual errors in dialogue summaries
GPT-4 as an evaluator performs poorly compared to specialized metrics
Non-LLM metrics better capture diverse hallucination types
Abstract
Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model's size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Machine Learning in Healthcare
MethodsLinear Layer · Dense Connections · Label Smoothing · Adam · Attention Is All You Need · Softmax · Multi-Head Attention · Layer Normalization · Residual Connection · Absolute Position Encodings
