AutoDataset: A Lightweight System for Continuous Dataset Discovery and Search

Junzhe Yang; Xinghao Chen; Yunuo Liu; Zhijing Sun; Wenjin Guo; Xiaoyu Shen

arXiv:2603.07271·cs.IR·March 10, 2026

AutoDataset: A Lightweight System for Continuous Dataset Discovery and Search

Junzhe Yang, Xinghao Chen, Yunuo Liu, Zhijing Sun, Wenjin Guo, Xiaoyu Shen

PDF

Open Access

TL;DR

AutoDataset is an automated, real-time system that continuously monitors arXiv to discover, extract, and index new datasets from research papers, significantly improving dataset discovery speed and coverage.

Contribution

We introduce AutoDataset, a lightweight, automated pipeline for real-time dataset discovery from research papers, combining classification, PDF parsing, URL extraction, and semantic search.

Findings

01

Achieves 0.94 F1 score in dataset paper classification

02

Reduces dataset discovery time by up to 80%

03

Enables low-latency natural language search for datasets

Abstract

The continuous expansion of task-specific datasets has become a major driver of progress in machine learning. However, discovering newly released datasets remains difficult, as existing platforms largely depend on manual curation or community submissions, leading to limited coverage and substantial delays. To address this challenge, we introduce AutoDataset, a lightweight, automated system for real-time dataset discovery and retrieval. AutoDataset adopts a paper-first approach by continuously monitoring arXiv to detect and index datasets directly from newly published research. The system operates through a low-overhead multi-stage pipeline. First, a lightweight classifier rapidly filters titles and abstracts to identify papers releasing datasets, achieving an F1 score of 0.94 with an inference latency of 11 ms. For identified papers, we parse PDFs with GROBID and apply a sentence-level…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsBiomedical Text Mining and Ontologies · Scientific Computing and Data Management · Research Data Management Practices