Beyond Manual Curation: Augmenting Targeted Protein Degradation Databases via Agentic Literature Extraction Workflows
Yaochen Rao, Farzaneh Jalalypour, N. M. Anoop Krishnan, and Roc\'io Mercado

TL;DR
This paper introduces an expert-in-the-loop LLM workflow for extracting targeted protein degradation data from scientific publications, significantly expanding and improving existing databases with high accuracy.
Contribution
It presents a domain-specific, scalable extraction method using LLMs with minimal expert annotations, enhancing TPD databases with high precision and recall.
Findings
Achieved record-level F1 of 0.98 with only seven annotated publications.
Expanded molecular glue and PROTAC databases by over 80%.
Recovered critical kinetic and assay-context information for modeling.
Abstract
Predictive models in biomedicine depend on structured assay data locked in the text, tables, and supplements of primary publications. This bottleneck is especially acute in targeted protein degradation (TPD), where each assay record must combine compound identity, degradation target, recruiter, assay context, and endpoint values reported across sections, tables, and supplementary files. Inconsistent compound identifiers and incomplete or implicit assay context further demand domain-specific logic that generic LLM pipelines do not provide. Existing molecular glue and PROTAC databases are manually curated and often lack the experimental context required for downstream modeling. We formulate TPD database extraction as a domain-specific curation task and present an expert-in-the-loop LLM workflow, evaluated through a triangular comparison among LLM predictions, standardized baseline…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
