Multimodal pre-training models of molecular representation for drug discovery
Xiaoqi Wang, Chuanshi Wang, Boya Ji, Junwen Wang, Mingyue Zheng, Lingyun Song, Shaoliang Peng, Xuequn Shang

TL;DR
This paper reviews how multimodal pre-training models are being used to improve drug discovery by combining different data types and techniques.
Contribution
The paper systematically reviews multimodal pre-training models and identifies emerging trends in molecular representation for drug discovery.
Findings
Multimodal pre-training models integrate Transformers and graph neural networks for cross-scale molecular representation.
Molecular captions help bridge drug discovery with large language models.
Adaptability between modalities and pre-training tasks is crucial for model effectiveness.
Abstract
With the great success of large language models in natural language processing, self-supervised pre-training models have emerged as an important technique in drug discovery. In particular, multimodal pre-training models have opened a new avenue for drug discovery. The experience and ideas from previous works can provide important reference points for further research in drug discovery. Therefore, this review summarizes the foundation of multimodal pre-training models and their progress in the field of drug discovery. We emphasize the adaptability between various modalities and network frameworks or pre-training tasks. At the same time, we summarize the difference and relevance between various modalities or pre-training models. Importantly, we identify two increasing trends that may serve as reference points for future research. Specifically, Transformers and graph neural networks are…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6| Model | Input data | Key modules | Characteristic |
|---|---|---|---|
| DNN | Structured vectors | Fully connected layer with nonlinear function |
|
|
| |||
| CNN | Grid-like data | Convolutional and pooling operation |
|
|
| |||
| RNN | Sequence data | Recurrent units |
|
|
| |||
| Transformer | Long sequence data | Self-attention mechanisms, and positional encodings |
|
|
| |||
| GNN | Graph data | Spectral graph convolutions or message passing mechanisms |
|
|
|
| Pre-training task | Optimization objective | Advantages and disadvantages |
|---|---|---|
| Multimodal contrastive learning | Modality alignment |
|
|
| ||
|
| ||
| Multimodal matching prediction | Consistency prediction |
|
|
| ||
|
| ||
| Multimodal masked prediction | Context reconstruction |
|
|
| ||
| Multimodal autoregressive prediction | Condition generation |
|
|
| ||
|
|
| Pre-training strategy | Key role of pre-training model | Advantages and disadvantages |
|---|---|---|
| Joint training | Regularization or auxiliary term |
|
|
| ||
|
| ||
| Unsupervised representation learning | Feature extraction |
|
|
| ||
|
| ||
| Pre-training and fine-tuning | Parameter initialization of encoders |
|
|
| ||
|
|
- —National Natural Science Foundation of China10.13039/501100001809
- —FDCT10.13039/501100006469
- —National Key Research and Development Program of China10.13039/501100012166
- —Fundamental Research Funds for the Central Universities10.13039/501100012226
- —Innovative Research Group Project10.13039/100014718
- —Hunan Science and Technology Innovation Plan
- —Key Research and Development Program of Hunan Province of China10.13039/501100019091
- —Key Technologies Research and Development Program10.13039/501100012165
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputational Drug Discovery Methods · Machine Learning in Bioinformatics · Machine Learning in Materials Science
INTRODUCTION
Drug development is of great significance to human health. For given targets, the pipeline of new drug development includes lead discovery and optimization, pre-clinical studies, clinical trials and regulatory approval. This is an expensive and time-consuming process with a very low success rate [1,2]. It takes on average US$2.6 billion and 10–15 years to develop a new drug. Approximately, five of the screened 10 000 compounds in the initial stage are selected for clinical trials [3], as shown in Fig. 1. Unfortunately, the number of new drugs gaining regulatory approval per billion USD spent halves every nine years, a trend sometimes called ‘Moore’s law in reverse’ [4]. To improve efficiency of new drug development, deep-learning technologies driven by a great deal of biomedical data have been applied to the pharmaceutical industry [2,5,6]. In particular, AlphaFold [7,8], a deep-learning-based protein structure prediction method, has emerged as a key technology for drug research and development [9,10]. With the rapid development of deep learning, many advanced models, including convolutional neural networks [11,12], recurrent neural networks [13,14], graph neural networks [15–18] and Transformer [19], are used for multiple tasks in drug discovery. The optimization of these neural networks relies on the large-scale samples that are labeled with molecular information, such as biochemical properties, drug targets and side effects. However, labeled data are very rare in real drug discovery scenarios, limiting the performance and ability of deep learning [20]. Fortunately, self-supervised pre-training technologies have shown promise in reducing the negative effects of sparse data, thus bringing new opportunities for the further development of drug discovery.
The key process of new drug development for a given target. Notably, this work aims to summarize multimodal pre-training models of molecular representation for drug discovery. However, pre-clinical studies, clinical trials and regulatory approval focus on the cellular or patient response rather than the molecular level. Therefore, this paper summarizes the tasks of drug discovery, including lead discovery and optimization.
Pre-training technologies have achieved remarkable success in the fields of natural language processing and computer vision. In particular, large language models have shown great capabilities in some applications [21,22]. In this context, pre-training technologies have also received increasing attention in drug discovery [23–25]. For example, UniCure [26] proposed a pre-training model for cancer therapy response prediction that helps to promote the development of drug discovery and personalized precision oncology. The key advantage of pre-training technologies is that models can learn potential patterns behind molecules via self-supervised tasks, in which supervision signals are automatically generated from unlabeled data [20,27]. Initially, inspired by BERT [28], the masked language model—a popular pre-training method—was applied to molecular sequences, such as the simplified molecular-input line-entry system (SMILES) [29]. The hypothesis behind this line of work is that molecular sequences can be treated as a type of scientific language [30,31]. Therefore, molecular sequences can directly interact with the masked language model to improve the performance of drug discovery. To capture higher-order structure and semantic features, there is increasing interest exploring pre-training strategies on two-dimensional (2D) molecular graphs, 3D geometries and molecular interaction networks [32–34]. At the same time, molecule captions, which refer to brief biomedical text descriptions, also provide a bridge between natural language processing and drug discovery. Therefore, numerous studies, such as MolT5 [35], and BioT5 [36], have designed pre-training models on a corpus composed of biomedical texts and molecule sequences for drug discovery. More recently, pre-training algorithms have moved from unimodal paradigms to multimodal paradigms. Research suggests that multimodal pre-training tasks are conducive to encouraging deep models to systematically understand the features of molecules, thus achieving the competitive results in drug discovery [34,37,38].
Multimodal pre-training models have emerged as a new paradigm in drug discovery and have shown great prospect. The experience and ideas within these previous works can provide the important reference points for the further development of drug discovery. Therefore, this paper presents a systematic review that summarizes the foundations and progress in the field of multimodal pre-training model-based drug discovery, in order to provide the reference points for future research. We summarize the foundations of molecular modalities and revisit the network frameworks, self-supervised tasks, training strategies and their applications in drug discovery. In each section, we emphasize the differences and relations between various modalities or methods. Simultaneously, we focus on the adaptability between various modalities and pre-training methods. Previous works suggest two increasing trends.
(1) There are increasing attempts to explore the combination of Transformer and graph neural networks to encode cross-scale molecular representations. Subsequently, these hybrid network frameworks are pre-trained by leveraging multiple self-supervised tasks, thus promoting the performance of drug discovery.
(2) Molecule captions, as brief biomedical texts, provide a bridge for collaboration between drug discovery and large language models.
Finally, we discuss the remaining challenges and explore future opportunities, including unified frameworks, discrimination of cross-modal consistency and complementarity, and fusion of biochemical knowledge with foundation models. We hope that this review will inspire future research to develop more advanced techniques for drug discovery. The key references investigated in this review can be accessed at https://github.com/AISciLab/MultiPM4Drug.git.
MULTIMODAL PRE-TRAINING MODELS OF MOLECULAR REPRESENTATION
We aim to summarize multimodal pre-training models of molecular representation for drug discovery, in which multimodal pre-training models refer to self-supervised tasks guiding the deep-learning networks to learn a unified representation for drug discovery. Notably, pseudo-labels of self-supervised tasks are automatically generated based on specific attributes of the multimodal data [39]. In multimodal pre-training models of molecular representation, there are four key components: molecular data, neural networks, pre-training tasks and pre-training strategies, as shown in Fig. 2. Specifically, a large amount of molecular data provides the opportunity for multimodal pre-training models. Subsequently, neural networks, acting as encoders, are pre-trained to encode molecular representations via self-supervised tasks. In cross-modal contrastive learning and matching prediction, there often are multiple embedding networks to encode different modalities, which are co-trained with cross-modal objectives. On the other hand, multiple types of input data are converted into the specific sequences [37,40], and then share the same encoder in cross-modal masked and autoregressive prediction. Subsequently, the pre-trained encoders are applied to drug discovery through different training strategies. More importantly, previous works suggest two increasing trends that may be used as reference points for future research.
(1) Transformers and graph neural networks are often integrated together as encoders, and then cooperate with multiple pre-training tasks to learn cross-modal molecular representation.
(2) Molecule captions, as brief biomedical texts, provide a bridge for collaboration between drug discovery and large language models.
Finally, we discuss the challenges of multimodal pre-training models in drug discovery and explore future opportunities.
The key content and conclusion of this review.
MULTIMODAL MOLECULAR DATA
The growth of biomedical data provides vast opportunities for deep-learning-based drug discovery. The scale and quality of biomedical data are key factors for the successful application of deep learning in drug discovery. Therefore, a series of free and open-access databases with structured biomedical data has been developed to promote progress in drug discovery. In this work we aim to summarize multimodal pre-training models of molecular representation for drug discovery. Accordingly, we summarize the popular databases that focus on collecting the molecular information and provide a brief description in Section S1. In these databases, as shown in Fig. 3, molecules are encoded in different modalities, including molecular descriptors, 1D molecular sequences, 2D molecular graphs, 3D molecular structures, molecular interaction networks and textual captions (for a detailed description, see Section S1). These different modalities reflect the molecular features from different perspectives. The form of molecular modalities determines the framework and performance of deep learning for drug discovery.
A diagram illustrating the multimodal representation of benzoic acid. Here, we summarize only part of the databases due to space limitations. More databases and their descriptions can been found in Section S1.
For a given molecule, 1D sequences, 2D graphs, 3D geometries and molecular fingerprints describe its structure features at different scales. Compared to 1D sequences, 2D molecular graphs and 3D molecular geometries can capture the associations and spatial relationships between atoms, respectively. However, in most databases, molecules are usually represented via 1D sequences to reduce storage space. Different structure representations can be converted into one another using biochemical tools such as RDKit [41], and Open Babel [42]. Molecular descriptors reflect the physical and chemical properties of molecules. Molecular interaction networks represent the regulatory relationships among different biological entities in living systems. It is intuitive to hold that molecular interaction networks can simulate the natural law of ‘multiple drugs, multiple targets and multiple diseases’. In addition, drug interaction networks can integrate more information and are more suitable for the development of multi-omics fusion [43], compared with representations of other modalities. Compared with the representations of other modalities, molecule captions are more concise and intuitive. Unlike molecular structure and interaction networks, which require the reader to have some level of chemical knowledge, a molecule caption is often a concise overview of key characteristics. It is natural to regard a molecule caption as a textual description. Therefore, molecule captions are more suitable for large language models. However, for a given molecule, various organizations often employ different types of language for molecule captions, resulting in poor standardisation and uniformity. Based on the momentum of large language models, there is growing interest in jointly modeling molecular representations and natural language. In particular, multimodal fusion is beneficial for systematically understanding molecules and improves the robustness of deep learning [44,45].
NEURAL NETWORK FRAMEWORKS
Multimodal representations of molecules are used to drive different neural networks for drug discovery, as shown in Table 1. These networks aim to learn complex relationships and high-quality representations from large volumes of data. The choice of network frameworks has a substantial impact on the predictive performance of drug discovery. Generally, five classical network architectures—deep neural networks (DNNs) [46], convolutional neural networks (CNNs) [11,12], recurrent neural networks (RNNs) [13,14], graph neural networks (GNNs) [15–17] and Transformer [19]—are widely applied to drug discovery.
These neural network frameworks show different levels of compatibility and adaptability for molecular representations with different modalities. The core modules of DNNs are fully connected layers with nonlinear functions (see Section S2). DNNs, as the basis of deep learning, have simple network architectures. However, DNNs ignore the interdependence among elements in input data. Therefore, DNNs are more suitable for structured vectors, such as molecular fingerprints and descriptors, in which each element denotes an independent feature value. By contrast, CNNs can capture the local structure in grid-like data, such as pixel distributions in images. In addition, CNNs reduce the size of parameters because the weights are shared in the convolutional kernels (see Section S3). Therefore, several studies hold that 2D molecular graphs can be treated as images and then fed into 2D CNNs for drug discovery [47–49]. On the other hand, numerous studies proposed 1D CNNs for sequential data by extending the principles of traditional convolutional networks [50]. In contrast to CNNs, RNNs are specifically designed for sequence data with temporal dependencies, such as textual descriptions, SMILES and audio. RNNs aim to learn the temporal dependencies in sequence data through recurrent units (see Section S4). Unfortunately, as sequence length increases, RNNs may suffer from vanishing or exploding gradient problems. To address this issue, Transformers employ a multi-head self-attention mechanism to calculate the relationships between different positions in sequential data, thus capturing long- and global-range dependencies. Simultaneously, Transformers integrate positional encodings to capture the order of each element in sequences.
Intuitively, DNNs, CNNs, RNNs and Transformers are designed for Euclidean-structured data but are difficult to apply to graph-structured data, such as social networks, transportation networks and protein-protein interaction networks. Therefore, GNNs have been developed to learn graph representations via spectral graph convolutions or message-passing mechanisms that can capture the topology of graph data [51]. Simultaneously, various geometric GNNs [52,53] and heterogeneous GNNs [54,55] extend the message-passing mechanism to geometric graphs and heterogeneous networks, respectively (see Section S5). However, deeper GNNs may suffer from over-smoothing issues, resulting in performance degradation [56]. Theoretically, different network frameworks have distinct characteristics. Inherently, these networks are modified or fused to improve performance in drug discovery.
With the success of Transformers and GNNs on sequence and graph data, they have attracted increasing attention from the field of drug discovery [57]. In particular, the Transformer has emerged as a leading technique in multimodal pre-training models for drug discovery. The Transformer adopts an encoder-decoder framework as shown in Fig. 4a. The encoder maps sequential data into low-dimensional representations, while the decoder maps these representation vectors to specific outputs. A more detailed description of the Transformer can be found in Section S6.
The architectures of different Transformer networks: (a) the original Transformer, (b) the original vision Transformer, (c) the Graphormer.
Encoder. The encoder is composed of multiple modules with the identical structures. Each module contains a multi-head self-attention layer and a fully connected feed-forward network layer, which consists primarily of two linear layers with a ReLU non-linear function. Each sub-layer internally implements residual connections [58] and layer normalization [59] operations that can accelerate the convergence of the model. The self-attention mechanisms primarily learns the long-range dependencies among elements (e.g. tokens) in sequential data. [19] Within each module, multiple self-attention models are employed to increase stability.
Decoder. Similar to the encoder, the decoder is also composed of a stack of multiple identical modules, which mainly consist of multi-head attention layers and feed-forward network layers. In addition, the decoder introduces a masked-attention layer to prevent current positions from attending to subsequent positions. In other words, the masked attention ensures that the representation at each position depends only on information from the left of the current position. This is mainly because the output at each position only needs to consider the influence of known data to its left. Finally, the outputs are completed by a learnable linear transformation and Softmax function.
The core component of the Transformer is multi-head self-attention mechanisms, which aims to learn the long-range dependencies among elements (e.g. tokens) in sequential data. The Transformer has achieved a major breakthrough in drug discovery. Generally, SMILES sequences are treated as a special chemical language and then fed into the Transformer encoder to learn representations for drug discovery [60,61]. In drug-target interaction prediction and target-aware molecule generation, amino acid sequences are also treated as a specialized language. Therefore, the Transformer is also used to encode representations of proteins, thereby enhancing the performance of drug discovery [62,63].
Recently, variants of the Transformer have been gradually applied to various fields (see Section S7). In particular, the vision Transformer (ViT) has become a core architecture in the computer vision field. As shown in Fig. 4b, ViT retains the core principles of the original Transformer. In ViT, input images are divided into fixed-size patches as tokens, which are then linearly embedded into a sequence of vectors. Similar to ViT, the Transformer has also been extended to graph-structure data [64]. In particular, a graph Transformer, named Graphormer [65], has achieved competitive performance compared with GNNs. To leverage the topology of graphs, Graphormer incorporates centrality encoding, spatial encoding and edge encoding into Transformer, as shown in Fig. 4c. In light of Graphormer, the graph Transformer has emerged as a promising paradigm for graph data. Its key advantage lies in its ability to capture the relevance among nodes and enhance the topology in graphs, including node centrality, spatial distance and edge features. Vision and graph Transformers have been widely applied to drug discovery. For example, to capture 2D and 3D structures, some works, such as MAT [66], Transformer-M [67], MOLEBLEND [68] and Interformer [69], employed graph Transformers for drug discovery. TREE [64] proposed a Transformer-powered graph representation learning for cancer gene prediction, facilitating target identification in anti-caner drug discovery. In addition, in ISMol [70], 2D molecular graphs are treated as chemical-structure images and fed into a visual Transformer for property prediction.
MULTIMODAL PRE-TRAINING TASKS
Pre-training task are among the most core components in self-supervised learning. Unimodal pre-training tasks (see Section S8), aim to guide models to learn the inherent representations and knowledge within the input data. Unlike unimodal tasks, multimodal pre-training objectives must account for alignment and fusion among different modalities. We summarize the difference between unimodal and multimodal pre-training tasks in Section S9. Here, we summarize four types of multimodal pre-training tasks in the field of drug discovery: contrastive learning [71,72], multimodal matching prediction [39,73], masked prediction [28,31] and autoregressive prediction [21,74]. The different pre-training tasks extract different features or attributes from the data, as shown in Table 2.
For a given entity, multimodal contrastive learning and matching prediction focus on capturing the consistency between various modalities, thus losing the specificity of each modality (see Sections S10– S11). The underlying principle of contrastive learning is to maximize agreement between positive samples and minimize agreement between negative samples [75,76]. Similarly, multi-modal matching prediction aims to learn high-quality representations by predicting whether a pair of samples from two modalities are matched (positive pair) or not matched (negative pair) [39]. Generally, in matching prediction and contrastive learning, instances of the same objects in different modalities are treated as positive pairs, and samples of different objects are treated as negative pairs. The key difference between matching prediction and contrastive learning is that the former maximizes the mutual information of positive pairs and minimize that of negative pairs, whereas the latter categorizes all samples as either positive or negative. Their performance is closely related to the number and quality of negative samples, respectively. For matching prediction, these positive and negative pairs are often easy to identify, which can limit its performance. In contrast, contrastive learning can avoid model collapse by using large-scale batches to generate sufficient negative samples. However, contrastive learning tasks fail to capture interaction or correspondences between modalities, whereas matching prediction tasks capture the cross-modal interactions by applying attention mechanisms or multilayer perceptrons after modality-specific encoders. Therefore, some studies, such as ISMol [70] and MolCA [77], have combined contrastive learning and matching prediction to achieve the complementary effect in drug discovery.
Masked prediction, initially developed for natural language processing by BERT [28], has emerged as a leading pre-training paradigm. The unimodal masked prediction task can improve a model’s ability to understand the internal features of input data. Recently, masked prediction tasks have been applied to multimodal molecular representation learning, where models predict masked information based on prompts from other modalities [70]. In cross-modal masked prediction, multiple types of input data are converted into unified sequences and then fed into an encoder [36,37]. Multimodal masked prediction ensures that models can understand the complex relationships and interactions between different modalities, further promoting the alignment of multimodal data. For single modality space, masked prediction focuses on the internal information within input data. However, multimodal masked prediction enables a deeper understanding of the fine-grained interaction among modalities, enhancing the representation ability.
Similar to masked prediction, autoregressive pre-training tasks [21,74] were initially proposed in natural language models and have been applied to SMILES sequence-based drug discovery (Section S13). In multimodal autoregressive prediction, data from different modalities are integrated into a unified sequence. Subsequently, autoregressive pre-training tasks predict the next tokens based on the preceding tokens in the unified sequences, preventing positions from attending to subsequent positions. Therefore, autoregressive pre-training tasks enhance the generative ability of models, but can impair their representation and understanding capabilities. Multiple studies have subsequently developed unified autoregressive models of text and molecules using wrapped sequences [40,78]. Similar to masked prediction, cross-modal autoregressive tasks ignore the consistency of each modality.
In the field of drug discovery, contrastive learning, matching prediction and masked prediction are more widely adopted than autoregressive tasks. This is mainly attributable to two reasons. First, autoregressive tasks utilize only unidirectional information, making them unable to capture the complete characteristics of molecules. Second, autoregressive tasks can enhance the generative ability of models. However, there are fewer generative tasks in the field of drug discovery. To further improve the performance of pre-training models, there have been increasing attempts to explore combinations of multiple pre-training tasks, leveraging their complementary strengths. This can be treated as a multi-task learning problem, in which a hybrid objective is often formulated as the weighted sum of multiple pre-training tasks. These pre-training tasks may be beneficial to different downstream tasks. Therefore, multi-task pre-training methods can lead to more flexible and general-purpose representations. However, combining multiple pre-training tasks increases hyperparameter complexity due to differing convergence rates and sample scales of the various tasks.
SELF-SUPERVISED TRAINING STRATEGIES
The training strategies serve as a crucial bridge between the pre-training and fine-tuning stages. In Table 3, we summarize the key mechanisms of different training strategies. According to the relationships between pre-training and downstream tasks shown in Fig. 5(e–f), training strategies can be divided into three categories: joint training, unsupervised representation learning and two-stage training (pre-training and fine-tuning) [24,76].
Multimodal self-supervised tasks (a–d) and training strategies (e–g).
Joint training. In this scheme, the encoder is trained by combining self-supervised tasks and downstream tasks (see Section S14) [79,80]. Generally, the pre-training tasks serve as a regularization or auxiliary term for downstream tasks [24]. Intuitively, how to reasonably balance the loss values of pre-training and downstream tasks is a key factor in joint training. Joint training can enhance the adaptability of pre-training models to current downstream tasks and mitigate catastrophic forgetting. However, it reduces the generality and transferability of pre-training models.
Unsupervised representation learning. In this strategy, the encoder and self-supervised predictor are trained using pre-training tasks. The encoder is then frozen and added to the front of a new predictor for downstream tasks [38,81,82]. Therefore, the pre-training phase can be regarded as the feature-extraction process. In unsupervised representation learning, all downstream tasks share a pre-trained model, reducing computation while increasing flexibility and generality (see Section S15). Importantly, training only on prediction heads helps prevent overfitting when downstream tasks have small-scale training samples. However, freezing all pre-trained parameters prevents the model from capturing downstream task-specific features, although it mitigates catastrophic forgetting.
Two-stage training. Similar to unsupervised representation learning, encoders are trained by self-supervised tasks, thereby obtaining an encoder with high-quality parameters (see Section S16). Subsequently, the pre-trained encoder serves as the initial model in the fine-tuning stage and is fine-tuned together with the prediction head for downstream tasks. In two-stage training, the pre-training process can be considered a parameter-initialization step for the encoders. The pre-trained models are optimized to meet the requirements of downstream tasks, resulting in higher adaptability for different downstream tasks. However, as the number of pre-training model parameters grows, full parameter fine-tuning methods incurs a tremendous computational cost. Therefore, parameter-efficient fine-tuning techniques, including additive, selective and reparameterized fine-tuning methods, are applied in drug discovery [83]. These parameter-efficient fine-tuning algorithms aim to reduce the number of fine-tuning parameters while achieving performance comparable to full fine-tuning methods.
In the field of drug discovery, two-stage training strategies have become a general technique, whereas reparameterized fine-tuning remains an emerging technique. A common disadvantage of three self-supervised training strategies is the potential for negative transfer when the goals of pre-training and downstream tasks are weakly correlated or in conflict [84]. To address this issue, several studies have explored prompt-learning-based fine-tuning strategies to reduce the gap between pre-training and downstream tasks [34,85].
DRUG DISCOVERY IN THE CONTEXT OF MULTIMODAL PRE-TRAINING
This work focuses on summarizing multimodal pre-training models for molecular representation in drug discovery. Accordingly, we summarized the tasks related to lead discovery and optimization, including molecular generation, molecular property prediction, drug-drug interaction prediction, drug-target interaction prediction and molecule captioning. As shown in Fig. 6, an interesting observation is that the combination of 2D/3D molecular structure-based graph neural networks and SMILES-based Transformers constitutes a relatively popular framework in drug discovery. Masked prediction and contrast learning have emerged as the leading paradigms for molecular property prediction, DDI prediction and DTI prediction. Autoregressive prediction tasks are mainly applied to molecule generation and molecule captioning. With the successful application of large pre-trained models in natural language processing and computer vision, biomedical text-based multimodal pre-training models are mainly applied to molecule generation, molecular property prediction and molecule captioning. Simultaneously, molecule captioning has attracted significant attention from experts in the field of computer science. Most studies suggest that integrating multiple types of multimodal pre-training tasks is important for drug discovery. In addition, parameter-efficient fine-tuning-based drug discovery remains an emerging field. Here, we introduce the application of multimodal pre-training in drug discovery. further details can be found in Section S17.
The application of multimodal pre-training in drug discovery. Note that this paper fails to summarize the development of pre-clinical studies, clinical trials and regulatory approval, as these mainly focus on cellular or patient responses.
Molecular generation
Molecular generation aims to design new drugs with specific properties that include high affinity, safety and activity, low toxicity and structural novelty. In recent years, molecular generation has benefited from advances in multimodal pre-training. In drug discovery, 1D sequences are often treated as a chemical language system. It is therefore natural to develop GPT-like chemical language models for molecule generation, in which the Transformer decoder is pre-trained by autoregressively predicting the next SMILES tokens and fine-tuned using multimodal adapter mechanisms for target-aware molecular generation [63]. Along this line, atomic 3D coordinates are also treated as linguistic expressions and combined with SMILES to develop GPT-like autoregressive models [86]. In the fine-tuning stage, the pre-trained models often synergize with reinforcement learning to optimize the biophysical and chemical properties of the generated molecules. To further capture high-level structural information, Lingo3DMol [87] proposed an autoregressive pre-training approach that reconstructs the perturbed molecule back to its original state in both 2D and 3D representations. In the fine-tuning stage, the three encoder layers were fixed. The 2D decoder generates sequence fragments and local coordinates, while the other decoders generate 3D coordinates.
In addition, with the success of large pre-trained models in natural language processing, biomedical text-based molecular generation has also achieved significant progress. Based on a corpus consisting of SMILES sequences and textual descriptions, MolT5 [35] employed the recovery of masked spans—an extension of masked language models—to pre-train Transformer-based T5 [88]. Similarly, an increasing number of studies have begun to integrate textual descriptions and SMILES to pre-train Transformers [89–92]. These pre-training models are fine-tuned for molecule captioning and generation. To enable the seamless combination of molecular sequence and 3D structures, the fine-grained 3D structure-aware fingerprint is mapped into a specialized 3D token. Next, these specialized 3D tokens, molecular sequences and textual descriptions are fed into a T5-based unified architecture, pre-trained using masked-span recovery and cross-modal translation tasks. To fuse multimodal data, another studies have used multiple networks to encode SMILES, 2D molecule graphs, molecular images and captions, respectively. The multimodal neural networks are pre-trained using cross-modal contrastive learning and matching prediction, and prompt-tuned for molecule captioning and generation [93].
Molecular property prediction
Molecular property prediction, a fundamental task in drug discovery, aims to infer the molecular attributes based on structure features. Multimodal pre-training techniques have been widely applied to molecular property prediction. There are two similar points among these methods. Firstly, most studies leveraged dual encoders that include SMILES-sequence-driven Transformer [60] and a 2D/3D-graph-structure-driven graph neural network [32,94]. Secondly, cross-modal contrastive learning is treated as a key pre-training task. In the fine-tuned stage, the task-specific prediction layer was attached to the pre-trained model for molecular property prediction. Inherently, contrastive-learning-based methods focus mainly on the consistency between different modalities, while ignoring the internal characteristics within each modality. To address this problem, more and more studies integrated multiple pre-training tasks, including contrastive learning, masked prediction and match prediction to improve the performance of property prediction [61,70,95]. In addition, the combination of more modalities, novel neural networks and domain knowledge is becoming an upward-trending research topic in molecular property prediction [33,34,68]. On the other hand, molecular-structure- and text-based pre-training models are applied to molecular property prediction. There are two types of popular paradigms. One is a deep-learning system that bridges molecular structures and biomedical texts [37]. To be specific, the molecular entities in biomedical texts are replaced with the segmented SMILES. Masked language models based on the mixed sequences encourage Transformers to capture the meta-knowledge among different semantic units, and are then fine-tuned for drug discovery, including molecular property prediction, named entity recognition and relation extraction. Other approaches often adopt multi-channel frameworks that integrate biomedical texts with SMILES and 2D and 3D molecular representations, and are pre-trained by contrastive learning [96–98]. Finally, the pre-trained models are fine-tuned for molecular property prediction.
Drug-drug interaction prediction
Drugs may interact with each other when using drug combinations to treat diseases. Drug–drug interaction (DDI) increases the risk of adverse drug reactions. Therefore, it is significant to explore multimodal self-supervised learning-based DDI predictions. Similar to molecular property prediction, most multimodal methods employ contrastive learning frameworks for DDI predictions. However, most of these studies focus on using contrastive learning to align molecular structure and molecular interaction networks [38,79]. Simultaneously, advanced heterogeneous graph neural networks have been developed to capture multiple relationships within molecular interaction networks, improving the performance of DDI prediction [80,99]. Furthermore, a few works have synthesized 2D graphs and semantic information in biomedical interaction networks via attribute-masking and link-prediction tasks [100].
Drug-target interaction prediction
Drug-target interaction (DTI) prediction aims to determine whether a given drug and target can interact with each other. Generally, DTI prediction is also treated as a drug-repositioning strategy that aims to screen potential drugs from existing ones for given targets [101]. Multimodal-based molecular pre-training is a potential technique in the field of DTI prediction. Based on SMILES and 2D molecular graphs, a growing body of work has employed masked prediction and contrastive learning to train Transformers and graph convolutional networks for drug–target interaction prediction [102]. Several works have learned features from molecular graphs and interaction networks via cross-modal graph-contrastive-learning approaches for drug–target affinity prediction, which is a similar task to DTI prediction [103]. On the other hand, BioT5 [36] integrates molecular SELFIES [104], protein sequences, general text wrapped text [40], in which molecule names were replaced with their corresponding SELFIES and gene names were appended with related protein sequences. To reduce the gap between pre-training and downstream tasks, BioT5 adopted prompt-based fine-tuning for drug-target interaction prediction and molecular property prediction.
Molecule captioning
Given a molecular structure, molecule captioning aims to provide comprehensive text descriptions, enhancing the understanding of key characteristics that include the molecule name, chemical formula, basic structure properties and functions [105]. In comparison with other tasks in drug discovery, molecule captioning is still an emerging field. With the successful application of pre-training large models in natural language and computer vision, there is increasing interest in deep-learning-based molecule captioning. We find that molecule captioning and text-based molecule generation are often treated as bi-directional translation between molecules and language. The molecule-captioning task is similar to image-caption generation [106]. Most methods of molecule captioning refer to language-image pre-training models. A wide range of approaches employed multiple self-supervised tasks to train encoders that integrated GNNs and text-based pre-training models to learn representation from 2D molecular graphs and textual descriptions [107,108]. These multimodal pre-trained models are fully or parameter-efficiently fine-tuned for molecule captioning, IUPAC-name prediction and molecule-text retrieval. Following the line of BLIP-2 [106], recent works have utilized the Q-Former to bridge the gap between molecules and texts [109,110]. The Q-Former enables large language models to interpret and analyze molecules. These studies jointly pre-train the Q-Former together with the frozen encoder via multi-objective training.
CONCLUSION AND OUTLOOK
In this paper, we present a comprehensiveness review of multimodal pre-training models for drug discovery, covering the spectrum from databases and molecular representations, through pre-training frameworks, tasks and strategies, to diverse applications. This review aims to discuss development of multimodal pre-training techniques in drug discovery and summarize novel perspectives. In the first place, we summarize the differences and relevance between various molecule representations. Then, we discuss different neural networks, and find that different modal representations are suitable for different network frameworks. In particular, molecular-sequence-based Transformers and 2D/3D-molecular-structure-based graph neural networks have emerged as the leading frameworks for drug discovery. Subsequently, we explore self-supervised tasks and pre-training strategies, highlighting the goals of different tasks and the trends of multi-modal and multi-task collaboration. Finally, we find that biomedical text provides a bridge for the integration of drug discovery with foundation models. Despite the significant progress, there are still challenges. In this section, we discuss the existing challenges and the future opportunities.
Designing unified and flexible multi-modal frameworks. In drug discovery, most multimodal approaches utilize Transformers and graph neural networks to encode molecular sequences and 2D/3D molecule structures, and then align different modalities via contrast learning or match prediction. Unfortunately, these coarse-grained alignment strategies miss the deep comprehension of items within different modalities [68]. On the other hand, graph neural networks often suffer from the over-smoothing phenomenon. In light of this, it is necessary to design unified pre-training frameworks that can fully encode features of each modality and ensure fine-grained alignment of various modalities. We may be able to refer to previous works, such as 3D-MolT5 [111], VLMO [112], SAPROT [113] and BEiT-3 [114], in which different modalities are converted into sequences and then fed into Transformers to learn representations. In addition, in practical scenarios, there is a high likelihood that some samples are missing certain modalities. Therefore, it is necessary to develop flexible frameworks to automatically handle cases with missing modalities.
Discriminating consistency and complementarity among various modalities. It is intuitive that there is not only complementarity but also consistency among multimodal representations of given molecules. Complementarity implies that each modality essentially has its specific or distinctive characteristics. In contrast, consistency suggests that there are shared attributes between different modality representations. Theoretically, multi-task pre-training approaches can capture cross-modal consistency and complementarity by integrating masked prediction, contrastive learning and matching prediction [33,70]. However, the major limitation of general multi-task frameworks is that the shared space may contain unnecessary specific features, and vice versa. In other words, general multi-task frameworks increase redundant information and computational complexity to a certain extent. Therefore, based on multi-task learning frameworks, determining how to discriminate and fuse consistency and complementarity among various modalities is crucial for improving the performance and efficiency of drug discovery.
Incorporating more modalities and domain knowledge. With the rapid development of life sciences, advanced techniques, such as molecular dynamics, single-cell sequencing techniques and high-throughput microscopies, generate more biomedical data with different modalities, including molecular videos [115], single-cell RNA sequencing data [116,117] and cell images [118], further promoting the application of multimodal pre-training models in the field of drug discovery. In addition, multimodal pre-training techniques can learn a variety of hidden information from these data. However, these deep-learning models find it difficult to automatically extract biochemical knowledge, which plays a key role in drug discovery. With the rapid development of life sciences and pharmacology, there is a wealth of biochemical knowledge, such as functional groups [119], coarse-grained model representations [120], motifs [121] and synthetic accessibility [122], that is closely related to molecular representations. The integration of pre-training models with such biochemical knowledge represents a significant opportunity to further advance drug discovery. Simultaneously, domain knowledge can enhance the interpretability of multimodal pre-training algorithms.
Integrating general-purpose foundation models. In the field of natural language processing, foundation models have made significant progress and have promoted the development of biomedical text mining. Being aware of this issue, a few studies, such as MolT5 [35] and BioT5 [36], incorporated textual descriptions and molecule SMILES to improve the performance of drug discovery. Unfortunately, these models only integrated 1D structure features, limiting the performance of drug discovery. On the other hand, compared to these specialized pre-trained models, general foundation models can learn the complex logic and context in long documents, resulting in a deeper understanding of the knowledge behind long texts. In addition, the corpus of general foundation models is more abundant and includes texts, images, audio and videos, providing interdisciplinary knowledge. Therefore, the fusion of general large language models with multimodal pre-training models opens new avenues for drug discovery and remains an open topic.
Supplementary Material
nwaf495_Supplemental_File
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Zhang P, Wang X, Cen X et al. A deep learning framework for in silico screening of anticancer drugs at the single-cell level. Natl Sci Rev 2025; 12: nwae 451.10.1093/nsr/nwae 45139872221 PMC 11771446 · doi ↗ · pubmed ↗
- 2Zhang K, Yang X, Wang Y et al. Artificial intelligence in drug development. Nat Med 2025; 31: 45–59.10.1038/s 41591-024-03434-439833407 · doi ↗ · pubmed ↗
- 3Harrer S, Shah P, Antony B et al. Artificial intelligence for clinical trial design. Trends Pharmacol Sci 2019; 40: 577–91.10.1016/j.tips.2019.05.00531326235 · doi ↗ · pubmed ↗
- 4Nosengo N. New tricks for old drugs: faced with skyrocketing costs for developing new drugs, researchers are looking at ways to repurpose older ones–and even some that failed in initial trials. Nature 2016; 534: 314–7.10.1038/534314 a 27306171 · doi ↗ · pubmed ↗
- 5Liu T, Lu D, Zhang H et al. Applying high-performance computing in drug discovery and molecular simulation. Natl Sci Rev 2016; 3: 49–63.10.1093/nsr/nww 00332288960 PMC 7107815 · doi ↗ · pubmed ↗
- 6Tang X, Lei X, Liu L. A multi-modal drug target affinity prediction based on graph features and pre-trained sequence embeddings. Interdiscip Sci Comput Life Sci 2025; 17: 822–43.10.1007/s 12539-025-00713-740455402 · doi ↗ · pubmed ↗
- 7Jumper J, Evans R, Pritzel A et al. Highly accurate protein structure prediction with alphafold. Nature 2021; 596: 583–9.10.1038/s 41586-021-03819-234265844 PMC 8371605 · doi ↗ · pubmed ↗
- 8Zhou X, Hu J, Zhang C et al. Assembling multidomain protein structures through analogous global structural alignments. Proc Natl Acad Sci USA 2019; 116: 15930–8.10.1073/pnas.190506811631341084 PMC 6689945 · doi ↗ · pubmed ↗
