SAFE setup for generative molecular design
Yassir El Mesbahi, Emmanuel Noutahi

TL;DR
This paper investigates optimal training setups for SAFE, a fragment-based molecular generative model, demonstrating its advantages over SMILES-based methods in drug design tasks.
Contribution
It systematically analyzes factors like dataset size, augmentation, architecture, and algorithms to optimize SAFE model performance in molecular generation.
Findings
Larger, diverse datasets improve model performance.
LLaMA architecture with Rotary Positional Embedding is most robust.
SAFE models outperform SMILES-based approaches in scaffold decoration and linker design.
Abstract
SMILES-based molecular generative models have been pivotal in drug design but face challenges in fragment-constrained tasks. To address this, the Sequential Attachment-based Fragment Embedding (SAFE) representation was recently introduced as an alternative that streamlines those tasks. In this study, we investigate the optimal setups for training SAFE generative models, focusing on dataset size, data augmentation through randomization, model architecture, and bond disconnection algorithms. We found that larger, more diverse datasets improve performance, with the LLaMA architecture using Rotary Positional Embedding proving most robust. SAFE-based models also consistently outperform SMILES-based approaches in scaffold decoration and linker design, particularly with BRICS decomposition yielding the best results. These insights highlight key factors that significantly impact the efficacy of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsChemistry and Chemical Engineering
MethodsLLaMA
