Omics Data Discovery Agents
Alexandre Hutton, Jesse G. Meyer

TL;DR
This paper introduces an agentic framework utilizing large language models to automate the retrieval, extraction, and analysis of omics data from biomedical literature, enabling scalable and reproducible data reuse.
Contribution
It presents a novel LLM-based system that fetches articles, extracts metadata, identifies datasets, and performs analyses, transforming static literature into executable research objects.
Findings
Achieved 80% precision in dataset identification from PubMed articles.
Re-quantified proteomics data with 63% overlap in differentially expressed proteins.
Demonstrated cross-study comparisons revealing consistent protein regulation patterns.
Abstract
The biomedical literature contains a vast collection of omics studies, yet most published data remain functionally inaccessible for computational reuse. When raw data are deposited in public repositories, essential information for reproducing reported results is dispersed across main text, supplementary files, and code repositories. In rarer instances where intermediate data is made available (e.g. protein abundance files), its location is irregular. In this article, we present an agentic framework that fetches omics-related articles and transforms the unstructured information into searchable research objects. Our system employs large language model (LLM) agents with access to tools for fetching omics studies, extracting article metadata, identifying and downloading published data, executing containerized quantification pipelines, and running analyses to address novel question. We…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBiomedical Text Mining and Ontologies · Scientific Computing and Data Management · Bioinformatics and Genomic Networks
