News Summarization and Evaluation in the Era of GPT-3

Tanya Goyal; Junyi Jessy Li; Greg Durrett

arXiv:2209.12356·cs.CL·May 25, 2023·182 cites

News Summarization and Evaluation in the Era of GPT-3

Tanya Goyal, Junyi Jessy Li, Greg Durrett

PDF

Open Access 1 Repo

TL;DR

This paper examines GPT-3's effectiveness in news summarization, showing it outperforms fine-tuned models in human preference and challenges existing evaluation metrics, while also exploring keyword-based summarization.

Contribution

It provides a comprehensive analysis of GPT-3's summarization capabilities, compares prompting to fine-tuning, and introduces new datasets and human judgments for evaluation.

Findings

01

GPT-3 summaries are preferred by humans over fine-tuned models.

02

Standard automatic metrics are unreliable for evaluating GPT-3 summaries.

03

Prompting with GPT-3 performs well in both generic and keyword-based summarization.

Abstract

The recent success of prompting large language models like GPT-3 has led to a paradigm shift in NLP research. In this paper, we study its impact on text summarization, focusing on the classic benchmark domain of news summarization. First, we investigate how GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, prompted using only a task description, but these also do not suffer from common dataset-specific issues such as poor factuality. Next, we study what this means for evaluation, particularly the role of gold standard test sets. Our experiments show that both reference-based and reference-free automatic metrics cannot reliably evaluate GPT-3 summaries. Finally, we evaluate models on a setting beyond generic summarization, specifically keyword-based summarization, and show how dominant…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tagoyal/factuality-datasets
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Test · Linear Layer · Cosine Annealing · Layer Normalization · Byte Pair Encoding · Linear Warmup With Cosine Annealing · Softmax