TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue   Summarization

Liyan Tang; Igor Shalyminov; Amy Wing-mei Wong; Jon Burnsky; Jake W.; Vincent; Yu'an Yang; Siffi Singh; Song Feng; Hwanjun Song; Hang Su; Lijia; Sun; Yi Zhang; Saab Mansour; Kathleen McKeown

arXiv:2402.13249·cs.CL·April 2, 2024·1 cites

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization

Liyan Tang, Igor Shalyminov, Amy Wing-mei Wong, Jon Burnsky, Jake W., Vincent, Yu'an Yang, Siffi Singh, Song Feng, Hwanjun Song, Hang Su, Lijia, Sun, Yi Zhang, Saab Mansour, Kathleen McKeown

PDF

Open Access 1 Repo 2 Datasets 1 Video

TL;DR

This paper introduces TofuEval, a benchmark for evaluating hallucinations in topic-focused dialogue summarization by LLMs, revealing significant factual errors and limitations of LLM evaluators compared to specialized metrics.

Contribution

It presents a new benchmark with human annotations for factual consistency in dialogue summarization and analyzes the performance of LLMs and metrics in detecting hallucinations.

Findings

01

LLMs hallucinate many factual errors in dialogue summaries

02

GPT-4 as an evaluator performs poorly compared to specialized metrics

03

Non-LLM metrics better capture diverse hallucination types

Abstract

Single document news summarization has seen substantial progress on faithfulness in recent years, driven by research on the evaluation of factual consistency, or hallucinations. We ask whether these advances carry over to other text summarization domains. We propose a new evaluation benchmark on topic-focused dialogue summarization, generated by LLMs of varying sizes. We provide binary sentence-level human annotations of the factual consistency of these summaries along with detailed explanations of factually inconsistent sentences. Our analysis shows that existing LLMs hallucinate significant amounts of factual errors in the dialogue domain, regardless of the model's size. On the other hand, when LLMs, including GPT-4, serve as binary factual evaluators, they perform poorly and can be outperformed by prevailing state-of-the-art specialized factuality evaluation metrics. Finally, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

amazon-science/tofueval
noneOfficial

Datasets

Videos

TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization· underline

Taxonomy

TopicsTopic Modeling · Biomedical Text Mining and Ontologies · Machine Learning in Healthcare

MethodsLinear Layer · Dense Connections · Label Smoothing · Adam · Attention Is All You Need · Softmax · Multi-Head Attention · Layer Normalization · Residual Connection · Absolute Position Encodings