A Large-Scale Dataset for Biomedical Keyphrase Generation

Mael Houbre; Florian Boudin; Beatrice Daille

arXiv:2211.12124·cs.CL·November 23, 2022·1 cites

A Large-Scale Dataset for Biomedical Keyphrase Generation

Mael Houbre, Florian Boudin, Beatrice Daille

PDF

Open Access 1 Repo 1 Datasets

TL;DR

This paper introduces kp-biomed, a large-scale dataset with over 5 million biomedical documents for keyphrase generation, and demonstrates that larger datasets significantly enhance model performance.

Contribution

The paper provides the first large-scale biomedical keyphrase dataset and evaluates generative models, showing improved results with increased data size.

Findings

01

Large-scale dataset improves keyphrase generation performance.

02

Models perform better on both present and absent keyphrases.

03

Dataset availability facilitates future research in biomedical NLP.

Abstract

Keyphrase generation is the task consisting in generating a set of words or phrases that highlight the main topics of a document. There are few datasets for keyphrase generation in the biomedical domain and they do not meet the expectations in terms of size for training generative models. In this paper, we introduce kp-biomed, the first large-scale biomedical keyphrase generation dataset with more than 5M documents collected from PubMed abstracts. We train and release several generative models and conduct a series of experiments showing that using large scale datasets improves significantly the performances for present and absent keyphrase generation. The dataset is available under CC-BY-NC v4.0 license at https://huggingface.co/ datasets/taln-ls2n/kpbiomed.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mhoubre/kpbiomed
noneOfficial

Datasets

taln-ls2n/kpbiomed
dataset· 43 dl
43 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Text Analysis Techniques