Topic Segmentation of Research Article Collections
Erion \c{C}ano, Benjamin Roth

TL;DR
This paper introduces a large, structured dataset of approximately seven million research articles, segmented by topics, to facilitate research tasks requiring topically organized collections.
Contribution
It presents a novel, large-scale, multitopic dataset of research articles with a constructed taxonomy and topic annotations, enabling diverse experimental applications.
Findings
Created a dataset of ~7 million research articles with topic annotations
Developed a taxonomy of research topics from the dataset
Enabled use of the dataset as heterogeneous or homogeneous collections
Abstract
Collections of research article data harvested from the web have become common recently since they are important resources for experimenting on tasks such as named entity recognition, text summarization, or keyword generation. In fact, certain types of experiments require collections that are both large and topically structured, with records assigned to separate research disciplines. Unfortunately, the current collections of publicly available research articles are either small or heterogeneous and unstructured. In this work, we perform topic segmentation of a paper data collection that we crawled and produce a multitopic dataset of roughly seven million paper data records. We construct a taxonomy of topics extracted from the data records and then annotate each document with its corresponding topic from that taxonomy. As a result, it is possible to use this newly proposed dataset in two…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Web Data Mining and Analysis · Advanced Text Analysis Techniques
