TL;DR
KPTimes is a large-scale news dataset with editor-curated keyphrases, enabling better training and evaluation of keyphrase generation models in the news domain.
Contribution
The paper introduces KPTimes, a novel large-scale dataset for news keyphrase generation, and analyzes editor annotation patterns compared to existing datasets.
Findings
State-of-the-art models perform variably on news data
Editors' tagging differs from scholarly datasets
KPTimes enhances model training for news keyphrases
Abstract
Keyphrase generation is the task of predicting a set of lexical units that conveys the main content of a source text. Existing datasets for keyphrase generation are only readily available for the scholarly domain and include non-expert annotations. In this paper we present KPTimes, a large-scale dataset of news texts paired with editor-curated keyphrases. Exploring the dataset, we show how editors tag documents, and how their annotations differ from those found in existing datasets. We also train and evaluate state-of-the-art neural keyphrase generation models on KPTimes to gain insights on how well they perform on the news domain. The dataset is available online at https://github.com/ygorg/KPTimes .
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
