Small Molecule Optimization with Large Language Models

Philipp Guevorguian; Menua Bedrosian; Tigran Fahradyan; Gayane; Chilingaryan; Hrant Khachatrian; Armen Aghajanyan

arXiv:2407.18897·cs.LG·July 29, 2024·1 cites

Small Molecule Optimization with Large Language Models

Philipp Guevorguian, Menua Bedrosian, Tigran Fahradyan, Gayane, Chilingaryan, Hrant Khachatrian, Armen Aghajanyan

PDF

Open Access 1 Repo 6 Models 3 Reviews

TL;DR

This paper introduces Chemlactica and Chemma, large language models trained on extensive molecular data, which enable efficient molecule generation and optimization for desired properties, advancing drug design capabilities.

Contribution

The paper presents two novel language models and a new optimization algorithm that together improve molecular property optimization with limited oracle access.

Findings

01

Achieved 8% improvement on Practical Molecular Optimization benchmark.

02

Demonstrated strong performance in molecule generation with specified properties.

03

Public release of models, corpus, and optimization algorithm.

Abstract

Recent advancements in large language models have opened new possibilities for generative molecular drug design. We present Chemlactica and Chemma, two language models fine-tuned on a novel corpus of 110M molecules with computed properties, totaling 40B tokens. These models demonstrate strong performance in generating molecules with specified properties and predicting new molecular characteristics from limited samples. We introduce a novel optimization algorithm that leverages our language models to optimize molecules for arbitrary properties given limited access to a black box oracle. Our approach combines ideas from genetic algorithms, rejection sampling, and prompt optimization. It achieves state-of-the-art performance on multiple molecular optimization benchmarks, including an 8% improvement on Practical Molecular Optimization compared to previous methods. We publicly release the…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 5Confidence 3

Strengths

The strengths of the paper lie in several key areas. 1. comprehensive dataset, this authors created a custom molecular corpus with over 100 million molecules from PubChem, incorporating detailed chemical properties. 2. The combination of LLMs with a genetic algorithm, prompt optimization, and rejection sampling allows the paper’s method to effectively explore chemical space and optimize for multiple properties at once. 3. Versatility: The models demonstrate adaptability, achieving high performa

Weaknesses

The weakness of this paper includes: 1. The model only consider the smile representation, it lacks explicit consideration of 3D conformation 2. The proposed optimization algorithm, while efficient, still relies on a high number of oracle evaluations. This paper could further improve reduce oracle calls, especially for applications where computationally intensive evaluations may be costly. 3. Limited experimental validation: while the paper demonstrates strong results on computational benchmarks,

Reviewer 02Rating 6Confidence 4

Strengths

1. The study successfully demonstrates the feasibility of molecule optimization using LLMs and a special token system. 2. The innovative approach of emulating genetic algorithms through the token system and Chain of Thought reasoning is particularly noteworthy.

Weaknesses

### Lack of Computational Efficiency Comparison 1. The paper does not provide a comparison of overall processing times between methods. ### Exploration of Efficient Training Methods 1. Have the authors considered more efficient learning methods beyond fine-tuning the entire model? 2. It would be interesting to explore the effects of techniques such as freezing specific layers, layer skipping, or parameter-efficient fine-tuning. ### Limited Exploration of LLM Capabilities for Multi-Property Opt

Reviewer 03Rating 6Confidence 4

Strengths

* The authors release the training corpus and model checkpoints * Table 1 shows the benefit of transfer learning * Property prediction experiments show strong performance * Molecular optimisation experiments are thorough and compared to strong baselines * The Appendix is detailed and the transparency around hyperparameter tuning, information around floating point precision is interesting

Weaknesses

Generally, descriptions of the model and pre-training are thorough but there are important metrics and discrepancies in the pre-training dataset that should at least be discussed. I will combine the specific points and related questions in the Questions section.

Code & Models

Repositories

yerevann/chemlactica
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMachine Learning in Materials Science · Computational Drug Discovery Methods