Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
Haydn Jones, Yimeng Zeng, Alden Rose, Li S. Yifei, Yining Huang, Kaiwen Wu, Jiaming Liang, Maggie Ziyu Huan, Yoseph Barash, Cesar de la Fuente-Nunez, Osbert Bastani, Zachary Ives, Mark Yatskar, Jacob R. Gardner

TL;DR
This paper introduces a scalable, AI-driven approach to transforming PubMed into detailed, structured biomedical datasets using large language models, hybrid retrieval, and multi-agent systems, surpassing traditional curated databases in size and nuance.
Contribution
It presents a novel pipeline combining LLM-based tagging, hybrid retrieval, and a multi-agent system to generate large, nuanced biomedical datasets directly from PubMed.
Findings
Generated ~6.3 million records across six biomedical tasks.
Achieved lower error rates (0.6-7.7%) compared to curated datasets.
Produced the largest public datasets for several biomedical properties.
Abstract
Manually curated biomedical repositories -- spanning bioactivity, genomics, and chemistry -- are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM-based entity-tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M-paper, 2.5T-token PubMed corpus; (2) hybrid sparse-dense retrieval supporting entity-filtered semantic queries over the tagged corpus; and (3) Starling, a multi-agent deep research system that, given only a natural-language task description, designs precision- and recall-targeted…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
