Interactive Distillation of Large Single-Topic Corpora of Scientific Papers
Nicholas Solovyev, Ryan Barron, Manish Bhattarai, Maksim E. Eren, Kim, O. Rasmussen, Boian S. Alexandrov

TL;DR
This paper introduces a machine learning-based tool that constructively builds targeted scientific literature datasets by leveraging citation networks, text embeddings, and human-in-the-loop selection, improving scalability and accuracy.
Contribution
The paper presents a novel interactive method combining citation analysis, text embeddings, and sub-topic modeling for scalable, targeted literature dataset creation with human oversight.
Findings
Effective in two machine learning fields
Enables scalable, targeted dataset construction
Improves accuracy over purely reductive methods
Abstract
Highly specific datasets of scientific literature are important for both research and education. However, it is difficult to build such datasets at scale. A common approach is to build these datasets reductively by applying topic modeling on an established corpus and selecting specific topics. A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert (SME) handpicks documents. This method does not scale and is prone to error as the dataset grows. Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature. Given a small initial "core" corpus of papers, we build a citation network of documents. At each step of the citation network, we generate text embeddings and visualize the embeddings through dimensionality reduction. Papers are kept in the dataset if they are…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Scientific Computing and Data Management · Computational and Text Analysis Methods
