Benchmarking Large Language Models for News Summarization

Tianyi Zhang; Faisal Ladhak; Esin Durmus; Percy Liang; Kathleen; McKeown; Tatsunori B. Hashimoto

arXiv:2301.13848·cs.CL·February 1, 2023·64 cites

Benchmarking Large Language Models for News Summarization

Tianyi Zhang, Faisal Ladhak, Esin Durmus, Percy Liang, Kathleen, McKeown, Tatsunori B. Hashimoto

PDF

Open Access 1 Repo

TL;DR

This paper evaluates large language models for news summarization, revealing that instruction tuning, not size, enhances zero-shot performance, and that high-quality human references are crucial for accurate assessment.

Contribution

It demonstrates that instruction tuning is key to LLM summarization ability and emphasizes the importance of high-quality references for evaluation.

Findings

01

Instruction tuning, not model size, improves zero-shot summarization.

02

High-quality human references lead to more accurate evaluation.

03

LLM summaries are judged comparable to human summaries despite stylistic differences.

Abstract

Large language models (LLMs) have shown promise for automatic summarization but the reasons behind their successes are poorly understood. By conducting a human evaluation on ten LLMs across different pretraining methods, prompts, and model scales, we make two important observations. First, we find instruction tuning, and not model size, is the key to the LLM's zero-shot summarization capability. Second, existing studies have been limited by low-quality references, leading to underestimates of human performance and lower few-shot and finetuning performance. To better evaluate LLMs, we perform human evaluation over high-quality summaries we collect from freelance writers. Despite major stylistic differences such as the amount of paraphrasing, we find that LMM summaries are judged to be on par with human written summaries.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tiiiger/benchmark_llm_summarization
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques