Beyond Manual Curation: Augmenting Targeted Protein Degradation Databases via Agentic Literature Extraction Workflows

Yaochen Rao; Farzaneh Jalalypour; N. M. Anoop Krishnan; and Roc\'io Mercado

arXiv:2605.11221·q-bio.QM·May 13, 2026

Beyond Manual Curation: Augmenting Targeted Protein Degradation Databases via Agentic Literature Extraction Workflows

Yaochen Rao, Farzaneh Jalalypour, N. M. Anoop Krishnan, and Roc\'io Mercado

PDF

TL;DR

This paper introduces an expert-in-the-loop LLM workflow for extracting targeted protein degradation data from scientific publications, significantly expanding and improving existing databases with high accuracy.

Contribution

It presents a domain-specific, scalable extraction method using LLMs with minimal expert annotations, enhancing TPD databases with high precision and recall.

Findings

01

Achieved record-level F1 of 0.98 with only seven annotated publications.

02

Expanded molecular glue and PROTAC databases by over 80%.

03

Recovered critical kinetic and assay-context information for modeling.

Abstract

Predictive models in biomedicine depend on structured assay data locked in the text, tables, and supplements of primary publications. This bottleneck is especially acute in targeted protein degradation (TPD), where each assay record must combine compound identity, degradation target, recruiter, assay context, and endpoint values reported across sections, tables, and supplementary files. Inconsistent compound identifiers and incomplete or implicit assay context further demand domain-specific logic that generic LLM pipelines do not provide. Existing molecular glue and PROTAC databases are manually curated and often lack the experimental context required for downstream modeling. We formulate TPD database extraction as a domain-specific curation task and present an expert-in-the-loop LLM workflow, evaluated through a triangular comparison among LLM predictions, standardized baseline…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.