A Dataset for Distilling Knowledge Priors from Literature for Therapeutic Design

Haydn Thomas Jones; Natalie Maus; Josh Magnus Ludan; Maggie Ziyu Huan; Jiaming Liang; Marcelo Der Torossian Torres; Jiatao Liang; Zachary Ives; Yoseph Barash; Cesar de la Fuente-Nunez; Jacob R. Gardner; Mark Yatskar

arXiv:2508.10899·cs.LG·September 15, 2025

A Dataset for Distilling Knowledge Priors from Literature for Therapeutic Design

Haydn Thomas Jones, Natalie Maus, Josh Magnus Ludan, Maggie Ziyu Huan, Jiaming Liang, Marcelo Der Torossian Torres, Jiatao Liang, Zachary Ives, Yoseph Barash, Cesar de la Fuente-Nunez, Jacob R. Gardner, Mark Yatskar

PDF

TL;DR

This paper introduces Medex, a large dataset of literature-derived priors for therapeutic design, enabling AI models to generate safer, more effective molecules by incorporating experimental constraints.

Contribution

The paper presents Medex, a novel dataset with 32.3 million facts from literature, and demonstrates its effectiveness in improving AI models for therapeutic molecule design.

Findings

01

Models pretrained on Medex outperform larger models on TDC tasks.

02

Medex-based models generate safer, near-effective molecules in GuacaMol.

03

Large literature-derived priors enhance AI reasoning in drug discovery.

Abstract

AI-driven discovery can greatly reduce design time and enhance new therapeutics' effectiveness. Models using simulators explore broad design spaces but risk violating implicit constraints due to a lack of experimental priors. For example, in a new analysis we performed on a diverse set of models on the GuacaMol benchmark using supervised classifiers, over 60\% of molecules proposed had high probability of being mutagenic. In this work, we introduce Medex, a dataset of priors for design problems extracted from literature describing compounds used in lab settings. It is constructed with LLM pipelines for discovering therapeutic entities in relevant paragraphs and summarizing information in concise fair-use facts. Medex consists of 32.3 million pairs of natural language facts, and appropriate entity representations (i.e. SMILES or refseq IDs). To demonstrate the potential of the data, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.