Mining Patents with Large Language Models Elucidates the Chemical Function Landscape
Clayton W. Kosonocky, Claus O. Wilke, Edward M. Marcotte, and Andrew, D. Ellington

TL;DR
This paper introduces a large dataset of chemical functions derived from patents using large language models, demonstrating its potential to identify drugs and explore chemical functionalities beyond traditional structure-based methods.
Contribution
The study creates and validates a large, text-derived chemical function dataset (CheF) that captures the functional landscape of molecules, enabling functional predictions from structure.
Findings
CheF dataset contains 631K molecule-function pairs.
The dataset reflects a coherent semantic landscape aligned with chemical structure.
Functional landscape can predict target functionality from molecular structure.
Abstract
The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of orthogonal methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631K molecule-function pairs, was created using an LLM- and embedding-based method to obtain functional labels for approximately 100K molecules from…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Microbial Natural Products and Biosynthesis
