AutoDataset: A Lightweight System for Continuous Dataset Discovery and Search
Junzhe Yang, Xinghao Chen, Yunuo Liu, Zhijing Sun, Wenjin Guo, Xiaoyu Shen

TL;DR
AutoDataset is an automated, real-time system that continuously monitors arXiv to discover, extract, and index new datasets from research papers, significantly improving dataset discovery speed and coverage.
Contribution
We introduce AutoDataset, a lightweight, automated pipeline for real-time dataset discovery from research papers, combining classification, PDF parsing, URL extraction, and semantic search.
Findings
Achieves 0.94 F1 score in dataset paper classification
Reduces dataset discovery time by up to 80%
Enables low-latency natural language search for datasets
Abstract
The continuous expansion of task-specific datasets has become a major driver of progress in machine learning. However, discovering newly released datasets remains difficult, as existing platforms largely depend on manual curation or community submissions, leading to limited coverage and substantial delays. To address this challenge, we introduce AutoDataset, a lightweight, automated system for real-time dataset discovery and retrieval. AutoDataset adopts a paper-first approach by continuously monitoring arXiv to detect and index datasets directly from newly published research. The system operates through a low-overhead multi-stage pipeline. First, a lightweight classifier rapidly filters titles and abstracts to identify papers releasing datasets, achieving an F1 score of 0.94 with an inference latency of 11 ms. For identified papers, we parse PDFs with GROBID and apply a sentence-level…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Scientific Computing and Data Management · Research Data Management Practices
