LDKP: A Dataset for Identifying Keyphrases from Long Scientific Documents
Debanjan Mahata, Navneet Agarwal, Dibya Gautam, Amardeep Kumar,, Swapnil Parekh, Yaman Kumar Singla, Anish Acharya, Rajiv Ratn Shah

TL;DR
This paper introduces LDKP, a large dataset of approximately 1.4 million scientific articles with keyphrases and full texts, addressing the limitations of existing datasets that focus only on titles and abstracts for keyphrase extraction.
Contribution
The authors release two extensive corpora with full texts and metadata, enabling research on keyphrase extraction from long, real-world scientific documents beyond summaries.
Findings
Provides a large-scale dataset for keyphrase extraction from full texts.
Facilitates research on real-world, long scientific documents.
Addresses limitations of existing datasets focused on titles and abstracts.
Abstract
Identifying keyphrases (KPs) from text documents is a fundamental task in natural language processing and information retrieval. Vast majority of the benchmark datasets for this task are from the scientific domain containing only the document title and abstract information. This limits keyphrase extraction (KPE) and keyphrase generation (KPG) algorithms to identify keyphrases from human-written summaries that are often very short (approx 8 sentences). This presents three challenges for real-world applications: human-written summaries are unavailable for most documents, the documents are almost always long, and a high percentage of KPs are directly found beyond the limited context of title and abstract. Therefore, we release two extensive corpora mapping KPs of ~1.3M and ~100K scientific articles with their fully extracted text and additional metadata including publication venue, year,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques
