Evaluating the Progression of Large Language Model Capabilities for Small-Molecule Drug Design
Shriram Chennakesavalu, Kirill Shmilovich, Hayley Weir, Colin Grambow, John Bradshaw, Patricia Suriana, Chen Cheng, Kangway Chuang

TL;DR
This paper introduces chemically-grounded tasks and RL-based post-training to evaluate and improve large language models for small-molecule drug design, highlighting significant potential and current limitations.
Contribution
It presents a new suite of chemically-grounded tasks as RL environments and demonstrates how targeted post-training enhances LLM performance in drug discovery.
Findings
Frontier models excel at chemical tasks but have room for improvement.
RL-based post-training significantly boosts model performance.
A smaller model with targeted training rivals state-of-the-art models.
Abstract
Large Language Models (LLMs) have the potential to accelerate small molecule drug design due to their ability to reason about information from diverse sources and formats. However, their practical utility remains unclear due to the lack of benchmarks that reflect real-world scenarios. In this work, we introduce a suite of chemically-grounded tasks spanning molecular property prediction, molecular representation transformations, and molecular design. Importantly, we formulate these tasks as reinforcement learning (RL) environments, enabling a unified approach for evaluation and post-training. Across three model families, we find that frontier models are increasingly proficient at chemical tasks, but that there is significant room for improvement, especially in experimental settings with low data. Critically, we show that RL-based post-training can substantially improve performance. A…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
