M$^{3}$-20M: A Large-Scale Multi-Modal Molecule Dataset for AI-driven Drug Design and Discovery
Siyuan Guo, Lexuan Wang, Chang Jin, Jinxian Wang, Han Peng, Huayang, Shi, Wengen Li, Jihong Guan, Shuigeng Zhou

TL;DR
M$^{3}$-20M is a massive multi-modal molecule dataset designed to enhance AI-driven drug discovery, significantly outperforming existing datasets in scale and supporting diverse molecular tasks with integrated data types.
Contribution
The paper introduces M$^{3}$-20M, the largest multi-modal molecule dataset, enabling improved model training for drug design and discovery tasks.
Findings
Enhanced model performance in molecule generation and property prediction.
Significant increase in diversity and validity of generated molecules.
Higher accuracy in property prediction compared to single-modal datasets.
Abstract
This paper introduces M-20M, a large-scale Multi-Modal Molecule dataset that contains over 20 million molecules, with the data mainly being integrated from existing databases and partially generated by large language models. Designed to support AI-driven drug design and discovery, M-20M is 71 times more in the number of molecules than the largest existing dataset, providing an unprecedented scale that can highly benefit the training or fine-tuning of models, including large language models for drug design and discovery tasks. This dataset integrates one-dimensional SMILES, two-dimensional molecular graphs, three-dimensional molecular structures, physicochemical properties, and textual descriptions collected through web crawling and generated using GPT-3.5, offering a comprehensive view of each molecule. To demonstrate the power of M-20M in drug design and discovery, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Genetics, Bioinformatics, and Biomedical Research
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · {Dispute@FaQ-s}How to file a dispute with Expedia? · 15 Ways to Contact How can i speak to someone at Delta Airlines · Attention Is All You Need · Attention Dropout · Position-Wise Feed-Forward Layer · Softmax · Cosine Annealing · Byte Pair Encoding · Linear Layer
