Interactive Distillation of Large Single-Topic Corpora of Scientific   Papers

Nicholas Solovyev; Ryan Barron; Manish Bhattarai; Maksim E. Eren; Kim; O. Rasmussen; Boian S. Alexandrov

arXiv:2309.10772·cs.IR·September 20, 2023

Interactive Distillation of Large Single-Topic Corpora of Scientific Papers

Nicholas Solovyev, Ryan Barron, Manish Bhattarai, Maksim E. Eren, Kim, O. Rasmussen, Boian S. Alexandrov

PDF

Open Access

TL;DR

This paper introduces a machine learning-based tool that constructively builds targeted scientific literature datasets by leveraging citation networks, text embeddings, and human-in-the-loop selection, improving scalability and accuracy.

Contribution

The paper presents a novel interactive method combining citation analysis, text embeddings, and sub-topic modeling for scalable, targeted literature dataset creation with human oversight.

Findings

01

Effective in two machine learning fields

02

Enables scalable, targeted dataset construction

03

Improves accuracy over purely reductive methods

Abstract

Highly specific datasets of scientific literature are important for both research and education. However, it is difficult to build such datasets at scale. A common approach is to build these datasets reductively by applying topic modeling on an established corpus and selecting specific topics. A more robust but time-consuming approach is to build the dataset constructively in which a subject matter expert (SME) handpicks documents. This method does not scale and is prone to error as the dataset grows. Here we showcase a new tool, based on machine learning, for constructively generating targeted datasets of scientific literature. Given a small initial "core" corpus of papers, we build a citation network of documents. At each step of the citation network, we generate text embeddings and visualize the embeddings through dimensionality reduction. Papers are kept in the dataset if they are…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Scientific Computing and Data Management · Computational and Text Analysis Methods