KPTimes: A Large-Scale Dataset for Keyphrase Generation on News   Documents

Ygor Gallina; Florian Boudin; B\'eatrice Daille

arXiv:1911.12559·cs.IR·December 2, 2019

KPTimes: A Large-Scale Dataset for Keyphrase Generation on News Documents

Ygor Gallina, Florian Boudin, B\'eatrice Daille

PDF

1 Repo 2 Models

TL;DR

KPTimes is a large-scale news dataset with editor-curated keyphrases, enabling better training and evaluation of keyphrase generation models in the news domain.

Contribution

The paper introduces KPTimes, a novel large-scale dataset for news keyphrase generation, and analyzes editor annotation patterns compared to existing datasets.

Findings

01

State-of-the-art models perform variably on news data

02

Editors' tagging differs from scholarly datasets

03

KPTimes enhances model training for news keyphrases

Abstract

Keyphrase generation is the task of predicting a set of lexical units that conveys the main content of a source text. Existing datasets for keyphrase generation are only readily available for the scholarly domain and include non-expert annotations. In this paper we present KPTimes, a large-scale dataset of news texts paired with editor-curated keyphrases. Exploring the dataset, we show how editors tag documents, and how their annotations differ from those found in existing datasets. We also train and evaluate state-of-the-art neural keyphrase generation models on KPTimes to gain insights on how well they perform on the news domain. The dataset is available online at https://github.com/ygorg/KPTimes .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ygorg/KPTimes
noneOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.