Consistency Evaluation of News Article Summaries Generated by Large (and Small) Language Models
Colleen Gilhuly, Haleh Shahzad

TL;DR
This paper evaluates the consistency of news article summaries generated by various large and small language models, introducing a meta evaluation score and finding that models often produce more consistent summaries than reference texts.
Contribution
It introduces a novel meta evaluation score for assessing LLM-based summary consistency and compares multiple summarization techniques across datasets.
Findings
All models produce more consistent summaries than references.
LLMs can effectively evaluate summary consistency.
Traditional metrics are complemented by LLM-powered assessments.
Abstract
Text summarizing is a critical Natural Language Processing (NLP) task with applications ranging from information retrieval to content generation. Large Language Models (LLMs) have shown remarkable promise in generating fluent abstractive summaries but they can produce hallucinated details not grounded in the source text. Regardless of the method of generating a summary, high quality automated evaluations remain an open area of investigation. This paper embarks on an exploration of text summarization with a diverse set of techniques, including TextRank, BART, Mistral-7B-Instruct, and OpenAI GPT-3.5-Turbo. The generated summaries are evaluated using traditional metrics such as the Recall-Oriented Understudy for Gisting Evaluation (ROUGE) Score and Bidirectional Encoder Representations from Transformers (BERT) Score, as well as LLM-powered evaluation methods that directly assess a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
