UniMAP: Universal SMILES-Graph Representation Learning
Shikun Feng, Lixin Yang, Yanwen Huang, Yuyan Ni, Weiying Ma, Yanyan, Lan

TL;DR
UniMAP is a novel universal model that effectively integrates SMILES and graph modalities for molecular representation, significantly improving performance on various drug-related prediction tasks through comprehensive cross-modality fusion.
Contribution
It introduces a multi-task pre-training framework that captures fine-grained semantics between SMILES and graph representations for molecular learning.
Findings
Outperforms state-of-the-art pre-training methods on multiple tasks.
Effectively captures cross-modality semantics for better molecular understanding.
Visualizations demonstrate improved representation quality.
Abstract
Molecular representation learning is fundamental for many drug related applications. Most existing molecular pre-training models are limited in using single molecular modality, either SMILES or graph representation. To effectively leverage both modalities, we argue that it is critical to capture the fine-grained 'semantics' between SMILES and graph, because subtle sequence/graph differences may lead to contrary molecular properties. In this paper, we propose a universal SMILE-graph representation learning model, namely UniMAP. Firstly, an embedding layer is employed to obtain the token and node/edge representation in SMILES and graph, respectively. A multi-layer Transformer is then utilized to conduct deep cross-modality fusion. Specially, four kinds of pre-training tasks are designed for UniMAP, including Multi-Level Cross-Modality Masking (CMM), SMILES-Graph Matching (SGM),…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Machine Learning in Bioinformatics
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Position-Wise Feed-Forward Layer · Dense Connections · Absolute Position Encodings · Adam · Label Smoothing · Residual Connection
