Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

Haydn Jones; Yimeng Zeng; Alden Rose; Li S. Yifei; Yining Huang; Kaiwen Wu; Jiaming Liang; Maggie Ziyu Huan; Yoseph Barash; Cesar de la Fuente-Nunez; Osbert Bastani; Zachary Ives; Mark Yatskar; Jacob R. Gardner

arXiv:2605.07022·cs.LG·May 19, 2026

Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

Haydn Jones, Yimeng Zeng, Alden Rose, Li S. Yifei, Yining Huang, Kaiwen Wu, Jiaming Liang, Maggie Ziyu Huan, Yoseph Barash, Cesar de la Fuente-Nunez, Osbert Bastani, Zachary Ives, Mark Yatskar, Jacob R. Gardner

PDF

1 Repo

TL;DR

This paper introduces a scalable, AI-driven approach to transforming PubMed into detailed, structured biomedical datasets using large language models, hybrid retrieval, and multi-agent systems, surpassing traditional curated databases in size and nuance.

Contribution

It presents a novel pipeline combining LLM-based tagging, hybrid retrieval, and a multi-agent system to generate large, nuanced biomedical datasets directly from PubMed.

Findings

01

Generated ~6.3 million records across six biomedical tasks.

02

Achieved lower error rates (0.6-7.7%) compared to curated datasets.

03

Produced the largest public datasets for several biomedical properties.

Abstract

Manually curated biomedical repositories -- spanning bioactivity, genomics, and chemistry -- are expensive to maintain, lag behind primary literature, and discard experimental context, obscuring nuances needed to assess data correctness and coverage. We show that PubMed itself can be autonomously and cost-effectively turned into structured datasets that are larger, more nuanced, and more accurate than the curated databases they replace. We present three coupled contributions: (1) an LLM-based entity-tagging pipeline, grounded in nine biomedical ontologies, that tags 4.5B entities across 19 categories in a 22.5M-paper, 2.5T-token PubMed corpus; (2) hybrid sparse-dense retrieval supporting entity-filtered semantic queries over the tagged corpus; and (3) Starling, a multi-agent deep research system that, given only a natural-language task description, designs precision- and recall-targeted…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

starling-labs/starling
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.