MMPCS: multi-view molecular pretraining based on consistency information and specific information

Chenyang Xie; Yingying Song; Song He; Xiaochen Bo; Zhongnan Zhang

PMC · DOI:10.1093/bioinformatics/btag028·January 14, 2026

MMPCS: multi-view molecular pretraining based on consistency information and specific information

Chenyang Xie, Yingying Song, Song He, Xiaochen Bo, Zhongnan Zhang

PDF

Open Access

TL;DR

This paper introduces MMPCS, a new method for molecular representation learning that improves predictions by combining shared and view-specific information from multiple molecular views.

Contribution

MMPCS explicitly factorizes molecular representations into consistency and specific components, achieving state-of-the-art performance on multiple molecular tasks.

Findings

01

MMPCS outperformed 16 existing methods in molecular property prediction tasks.

02

It showed strong performance in predicting drug-target binding affinity and cancer drug response.

03

The method was effective in a case study for SARS-CoV-2 drug repurposing.

Abstract

The goal of molecular representation learning is to automate the extraction of molecular features, a critical task in cheminformatics and drug discovery. While pretraining models using multiple views like SMILES, 2D graphs, and 3D conformations have advanced the field, integrating them effectively to produce superior representations remains a challenge. To bridge this gap, we propose a novel multi-view molecular pretraining method termed MMPCS, which explicitly factorizes representations into consistency and specific information. Our approach utilizes the Graph Isomorphism Network and the RoBERTa model to encode 2D molecular topological graphs and SMILES sequences, respectively. Each resulting molecular embedding is decomposed into a shared consistency component and a view-specific remainder. An autoencoder then aligns the consistency information across views. The combined consistency…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes1

TXK

Proteins1

Species3

Severe acute respiratory syndrome coronavirus 2(no rank)Mus musculus(house mouse · species)Homo sapiens(human · species)

Chemicals7

Abemaciclib Lp halogens Brigatinib Copanlisib MMPCS ClinTox

Diseases10

cancer SARS-CoV-2 Malaria PCC COVID-19 SCC lung inflammation CEP infection toxicity

Mutations1

V600E

Figures5

Click any figure to enlarge with its caption.

Flowchart of the MMPCS framework, illustrating its pretraining and fine-tuning stages. (a) During pretraining, MMPCS uses covariance and alignment loss to extract information from two distinct molecular views. (b) In fine-tuning, MMPCS leverages both SMILES and molecular graphs to enhance performance on downstream tasks.

Sankey diagram visualizes the predicted drug–target affinities between four SARS-CoV-2 Omicron variant proteins (top) and the top 10 FDA-approved drugs (bottom). The width of each ribbon represents the predicted affinity strength, where a thicker ribbon denotes a stronger affinity.

Performance trends of MMPCS under varying loss weight β (with α fixed at 1.0) for (a) classification and (b) regression tasks.

Effect of GIN network depth on model performance across (a) classification and (b) regression datasets.

t-SNE visualization of molecular representations fine-tuned on the Estrogen dataset, which contains two binary classification tasks (α and β). The first row (a–c) corresponds to the α task, and the second row (d–f) to the β task. (a) and (d): SMILES embedding concatenated with the view-specific information of the molecular graph (Se∥geu); (b) and (e): view-specific information of the SMILES sequence concatenated with the embedding of the molecular graph (Seu∥ge); (c) and (f): direct concatenation of the SMILES and molecular graph embeddings (Se∥ge).

Tables10

Table 1. Pretraining dataset information.

Property	Value
Molecules	500 000
Avg. atoms	27.4
Avg. bonds	29.8
Avg. degree	2.17
Avg. molecular weight	391.6

Table 2. Classification task datasets information.

Dataset	Compounds	Tasks	Split	Labeled data ratio
BBBP	2050	1	Scaffold	100%
Tox21	7831	12	Random	82.94%
ToxCast	8597	617	Random	27.92%
SIDER	1427	27	Random	100%
ClinTox	1484	2	Random	100%
BACE	1513	1	Scaffold	100%
Estrogen	3122	2	Random	63.57%
MetStab	2267	2	Random	100%

Table 3. Regression task datasets information.

Dataset	Compounds	Split
ESOL	1128	Random
Lipo	4200	Random
FreeSolv	642	Random
Malaria	9999	Random
CEP	29978	Random

Table 4. Hyperparameters of the MMPCS model.

Hyperparameters	Setting
Pretrain epoch	1
Pretrain batch size	32
Learning rate	0.00005
Dim of final SMILES/graph embedding	256
Num of GIN conv layer	3
Coefficient of loss $α, β$	[1, 1]
Fine-tuning epoch	100
Fine-tuning batch size	64

Table 5. Classification results (ROC-AUC, %) on molecular property prediction tasks.a

Category	Dataset	BBBP	Tox21	ToxCast	SIDER	ClinTox	BACE	Estrogen	MetStab	Avg
Sequence-Based	SMILES-Transformer	52.50	62.66	77.11	79.89	92.59	62.48	57.59	74.84	69.95
Graph-Based	GraphCL	70.05	58.17	50.58	54.83	95.25	78.1	85.08	77.57	71.20
	JOAO	68.99	50	50.26	53.47	92.25	77.9	88.76	83.46	70.63
	JOAOv2	71.76	55.75	50.76	53.74	92.9	72.03	91.08	77.43	70.68
	AD-GCL	69.98	50.52	51.33	52.93	90.19	75.98	76.98	74.91	67.85
	GraphLoG	68.95	62.55	50.51	48.69	95.97	76.09	75.21	72.82	68.84
	MGSSL	71.41	60.38	49.19	57.36	92.38	81.91	80.16	88.67	72.68
	Lp-Info	66.61	55.06	50.61	51.17	95.79	73.32	75.49	75.66	67.96
	SimGrace	63.14	52.4	50.02	53.26	92.68	74.3	83.1	79.33	68.52
	GraphMAE	68.8	55.08	51.27	52.75	95.33	76.66	85.97	81.21	70.88
	MolCLR	71.66	79.45	72.8	59.62	49.94	78.00	94.16	79.69	73.16
	Mole-BERT	69.26	57.25	49.59	50.18	95.49	75.51	90.95	79.98	71.02
Multi-View	GraphMVP*	68.5	74.5	62.7	62.3	79.0	76.8	-	-	-
	3DInfomax*	69.1	74.46	64.41	53.37	59.43	79.42	-	-	-
	FragCL*	71.4	75.2	65.1	61	95.2	82.3	-	-	-
	MEMO*	71.6	76.7	64.9	61.2	81.6	82.6	-	-	-
	MMPCS (ours)	73.41	81.33	86.81	82.55	98.76	80.16	95.14	92.65	86.35

Table 6. Regression results (RMSE) on molecular property prediction tasks.a

Category	Dataset	ESOL	Lipo	FreeSolv	Malaria	CEP	Avg
Sequence-Based	SMILES-Transformer	2.095	1.121	3.790	1.176	2.539	2.144
Graph-Based	GraphCL	0.989	0.726	1.697	0.998	1.302	1.142
	JOAO	0.807	0.705	1.472	1.035	1.466	1.097
	JOAOv2	0.809	0.698	1.637	1.046	1.477	1.133
	AD-GCL	0.936	0.758	1.793	0.977	1.646	1.222
	GraphLoG	0.927	0.714	1.956	0.984	1.337	1.184
	MGSSL	0.766	0.725	1.595	0.956	1.292	1.067
	Lp-Info	1.191	0.709	1.885	0.977	1.320	1.216
	SimGRACE	1.365	0.838	1.861	0.997	1.363	1.285
	GraphMAE	1.007	0.701	1.676	0.986	1.289	1.132
	MolCLR	0.861	0.623	1.804	0.998	1.004	1.058
	Mole-BERT	0.838	0.679	1.866	0.982	1.304	1.134
Multi-View	GraphMVP*	1.091	0.718	-	1.114	1.236	-
	3DInfomax*	0.894	0.695	2.337	-	-	-
	MEMO*	0.984	0.707	-	1.093	1.101	-
	MMPCS (ours)	0.595	0.649	1.473	0.999	1.018	0.946

Table 7. Performance of various methods on the CDR task.a

Methods	PCC↑	SCC↑
MoLI	0.813	0.782
CDRscan	0.871	0.852
tCNNs	0.910	0.889
DeepCDR	0.923	0.898
DeepTTA	0.941	0.914
MMPCS	0.946	0.933

Table 8. Performance of various methods on the DTA task.a

Methods	CI↑	MSE↓
KronRLS	0.782	0.411
SimBoost	0.836	0.222
DeepDTA	0.865	0.194
WideDTA	0.875	0.179
GraphDTA	0.882	0.174
MMPCS	0.882	0.150

Table 9. Ablation study evaluating MMPCS under different variants on representative datasets.

Dataset	BBBP	SIDER	Estrogen	ClinTox	ESOL	FreeSolv

	ROC-AUC↑				RMSE↓
${MMPCS}_{w / o align}$	68.91	81.92	94.83	95.85	0.617	2.315
${MMPCS}_{w / o cov}$	67.23	82.53	94.97	98.72	0.802	1.860
${MMPCS}_{SMILES}$	60.08	80.68	62.23	91.30	0.721	4.009
${MMPCS}_{graph}$	65.84	82.41	94.11	96.48	0.928	1.788
${MMPCS}_{w / o pre}$	70.02	82.12	94.96	98.53	0.667	1.489
$MMPCS$	73.41 ^a	82.55	95.14	98.76	0.595	1.473

Table 10. Comparison of different decorrelation loss functions on representative datasets.

Dataset	BBBP	SIDER	Estrogen	ESOL	FreeSolv

	ROC-AUC↑			RMSE↓
${MMPCS}_{OL}$	70.59	82.21	94.52	0.648	1.961
${MMPCS}_{HSIC}$	68.40	82.17	94.73	0.612	1.939
${MMPCS}_{MINE}$	70.87	81.69	88.11	1.288	2.128
$MMPCS$	73.41 ^a	82.55	95.14	0.595	1.473

Equations36

Funding2

—National Key R&D Program of China10.13039/501100012166
—Fujian Sunshine Charity Foundation of China

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Drug Discovery Methods · Machine Learning in Materials Science · Advanced Graph Neural Networks

Full text

1 Introduction

Molecular representation learning focuses on the automatic extraction of molecular features and plays a critical role in various downstream tasks such as molecular property prediction (Yang et al. 2019) and molecular design (Du et al. 2022). Deep learning has been widely adopted in this field, significantly enhancing the efficiency of molecular representation learning, and driving advancements in cheminformatics and drug discovery. However, deep-learning models typically rely on large amounts of labeled data for effective training. In data-scarce scenarios, these models are prone to overfitting and may capture spurious correlations (Sagawa et al. 2020). To maximize the potential of deep learning in molecular representation learning, addressing the challenges posed by insufficient labeled data is essential.

In recent years, inspired by the remarkable success of leveraging large-scale unlabeled data in natural language processing and computer vision, pretraining has been increasingly applied to molecular representation learning (Xia et al. 2023a). These models are typically pretrained on objectives derived from extensive unlabeled datasets and subsequently fine-tuned on labeled downstream tasks.

Pretrained models in the molecular domain can be broadly categorized into three classes by input modalities. The first category relies on Simplified Molecular Input Line Entry System (SMILES) sequences (Weininger 1988). Early approaches, such as SMILES-Transformer (Honda et al. 2019), adopted autoencoder architectures (Hinton and Salakhutdinov 2006) that trained the decoder to reconstruct the original input sequence. Subsequent approaches, such as SMILES-BERT (Wang et al. 2019) and ChemBERTa (Ahmad et al. 2022), introduced masked language modeling by randomly masking parts of the SMILES sequences during training. Recent studies have further enhanced SMILES-based pretraining by incorporating strategies such as functional-group-level masking (Peng et al. 2025) and randomized SMILES augmentation (Lee et al. 2025). Despite these advancements, these models are inherently limited by their reliance on linear representations, which struggle to capture the intrinsic structural information of molecules.

Pretraining models based on 2D molecular topological graphs represent the second category of approaches. Hu et al. (Hu et al. 2020) introduced masked graph modeling by predicting masked node attributes or masked substructures. GraphMAE (Hou et al. 2022), on the other hand, reconstructed masked node features using a decoder. Beyond masking strategies, contrastive learning has been widely adopted. Methods such as GraphCL (You et al. 2020), JOAOv2 (You et al. 2021), AD-GCL (Suresh et al. 2021), and MolCLR (Wang et al. 2022) learned by maximizing the consistency between augmented graph views. Other studies have further extend molecular representation learning through motif-level self-supervision (Yan et al. 2024) or subgraph-conditioned bottlenecks for core-structure disentanglement (Hoang and Lee 2025). Common graph augmentations include node dropout, edge perturbation, feature masking, and subgraph extraction. However, even minor structural perturbations may distort molecular semantics. To mitigate this issue, MoCL (Sun et al. 2021) incorporated domain knowledge to guide substructure replacement, but this limits its generality and increases costs. Such methods that rely heavily on data augmentation have limitations in terms of efficiency and applicability. To avoid semantic distortion and reduce dependence on data augmentation, our model entirely eliminates the use of data augmentation.

The third category comprises pretrained models based upon multiple molecular views. Multi-view approaches have demonstrated significant advantages in bioinformatics for learning richer molecular representations. For instance, GraphMVP (Liu et al. 2022) jointly utilized 2D molecular graphs and computationally expensive 3D geometries. MM-Deacon (Guo et al. 2022) leveraged both SMILES and IUPAC sequences, while DVMP (Zhu et al. 2023) applied contrastive learning between SMILES and 2D graphs. MEMO (Zhu et al. 2022) integrated four distinct modalities: SMILES sequences, 2D graphs, 3D structures, and molecular fingerprints (Cereto-Massagué et al. 2015). By integrating 2D, 3D, and textual data, Uni-Mol2 (Wang et al. 2024) introduced a unified multi-view framework for molecular representation However, existing multi-view approaches predominantly emphasize cross-view consistency, often overlooking the view-specific information that not only complemented such consistency but also provides additional discriminative cues. Fully leveraging both consistency and view-specific information can yield more comprehensive and expressive molecular representations (Chen et al. 2023).

To address this limitation, we introduce MMPCS, a multi-view molecular pretraining model that explicitly factorizes representations into consistency and view-specific information. Our model leverages two prevalent molecular representations: 2D topological graphs and SMILES sequences. Building on these two views, we design an innovative dual-view pretraining strategy. Frist, each view is fed into a dedicated encoder to extract its corresponding feature representation. The resulting embeddings are then decomposed into an inter-view consistency information shared across views and a view-specific information unique to each view. To ensure a clean disentanglement, we minimize the covariance between these two components. Subsequently, the consistency information across views is aligned through a mutual prediction task. For downstream tasks, the final molecular representation is formed by concatenating the aligned consistency information with the original specific information from each view, thereby providing a comprehensive encoding for enhanced predictive performance.

The main contributions of this study can be summarized as follows:

We introduce a multi-view molecular pretraining model that incorporates not only the consistency information shared across different views but also the specific information unique to each view.MMPCS aligns consistency information across views without relying on data augmentation, thereby mitigating potential issues may arise from significant changes in molecular properties due to augmentation.MMPCS maintains high performance while requiring only a single copy of consistency information for downstream tasks, effectively reducing data dimensionality and improving computational efficiency.

To evaluate MMPCS, we began by pretraining the model on 500 000 randomly selected SMILES sequences from the ZINC dataset (Irwin and Shoichet 2005). Subsequently, MMPCS was fine-tuned on eight property classification datasets and five molecular property regression datasets for property prediction tasks. Compared with 16 classical and state-of-the-art methods, MMPCS achieved top or second-best performance on 11 molecular property prediction (MPP) datasets, and attained the highest average results across both the classification and regression tasks. Notably, MMPCS outperformed the most advanced multi-view pretraining models across all five regression datasets. Moreover, it achieved state-of-the-art performance in drug-target binding affinity (DTA) prediction and cancer drug response (CDR) tasks. Furthermore, visualization results show that MMPCS enhances the efficiency of molecular representations while preserving essential information. These findings collectively highlight the effectiveness and robustness of MMPCS in molecular representation learning.

2 Materials and methodology

2.1 Problem definition

Given a molecule $[eqn]$ , let $[eqn]$ and $[eqn]$ denote the SMILES sequence and 2D topological graph of the molecule, respectively. $[eqn]$ is denoted by a sequence $[eqn]$ , where $[eqn]$ is its length. $[eqn]$ is represented as a graph $[eqn]$ , where $[eqn]$ is the set of atoms constituting molecule $[eqn]$ , and $[eqn]$ is the set of chemical bonds between atoms. The loss of the pretraining model can be expressed by Equation (1):

[eqn]

where $[eqn]$ represents the parameters of the pretraining model $[eqn]$ , and $[eqn]$ denotes the self-supervised learning loss function, which is a non-negative value. Our goal is to learn the optimal model parameters $[eqn]$ for $[eqn]$ such that the loss is as close to zero as possible.

2.2 Datasets

The pretraining dataset comprises 500 000 randomly selected SMILES sequences from the ZINC (Table 1). For downstream tasks, we focused on MPP, CDR, DTA, and SARS-CoV-2 drug repurposing. To comprehensively evaluate the effectiveness of the proposed pretraining model, we used eight classification and five regression datasets, including six classification and three regression datasets from MoleculeNet (Wu et al. 2018). We further incorporated four additional datasets, Estrogen (Gaulton et al. 2012), MetStab (Podlewska and Kafel 2018), CEP (Hachmann et al. 2011), and Malaria (Gamo et al. 2010), to broaden the evaluation scope. The Estrogen and MetStab datasets offer biologically relevant information on pharmacological activity and metabolic stability, whereas CEP and Malaria extend the assessment to materials science and antimalarial screening, respectively. This diverse combination ensures MMPCS to be rigorously evaluated across a wide spectrum of chemical and biological properties. To ensure a rigorous and fair comparison with existing studies, we adopted different data splitting strategies. Following MoleculeNet’s recommendation, a scaffold-based split was applied to the BBBP (Martins et al. 2012) and BACE (Subramanian et al. 2016) datasets, which helps prevent structurally similar molecules from appearing across training and test sets, thereby offering a more challenging evaluation of model generalizability. For the remaining datasets, Tox21 (Huang et al. 2016), ToxCast (Richard et al. 2016), SIDER (Kuhn et al. 2016), ClinTox (Gayvert et al. 2016, Fromer and Coley 2023, Feng et al. 2024), Estrogen, MetStab, ESOL (Delaney 2004), Lipo, FreeSolv (Mobley and Guthrie 2014), Malaria, and CEP, we applied a conventional random split to maintain consistency with prior benchmark studies, as detailed in Tables 2 and 3. Together, these datasets cover key aspects of drug discovery, spanning pharmacology, physical chemistry, and physiology.

For the CDR task, we used the commonly used GDSC dataset (Chapman et al. 2011). For the DTA task, we specifically utilized the KIBA dataset (Tang et al. 2014), an established benchmark in this domain. We followed its recommended data split method for reliable results. The model fine-tuned on KIBA was then directly applied to predict SARS-CoV-2 drug repurposing.

For models that require 2D molecular topological graphs as input, SMILES sequences were converted into molecular graphs using RDKit (Landrum 2006), and the features of nodes and edges were initialized to obtain 2D molecular topological graphs. SMILES sequences were converted into IUPAC identifiers via the STOUT model (Rajan et al. 2021) where such inputs were required.

2.3 Evaluation metrics

The pretraining dataset comprises 500 000 randomly selected SMILES sequences from the ZINC (Table 1). For downstream tasks, we focused on. To evaluate the performance of the model, the Receiver Operating Characteristic—Area Under the Curve (ROC-AUC) is used to evaluate the classification task in the downstream MPP task. ROC-AUC ranges from 0 to 1, with higher values indicating better performance. The ROC curve, plotting TPR against FPR, allows ROC-AUC computation via Equation (2):

[eqn]

The Root Mean Square Error (RMSE) is used to evaluate the regression task in the MPP. A smaller RMSE indicates that the model predictions $[eqn]$ are closer to the true values $[eqn]$ . The RMSE is expressed by Equation (3):

[eqn]

For the CDR task, the evaluation metrics are Pearson’s correlation coefficient (PCC) and Spearman’s correlation coefficient (SCC). PCC quantifies the linear relationship between two variables, on a scale from −1 to 1, based on the true values $[eqn]$ , the predicted values $[eqn]$ , and their respective means $[eqn]$ and $[eqn]$ . PCC can be expressed as follows:

[eqn]

Unlike PCC, SCC measures the monotonic relationship between variables, does not assume linearity, and is robust to outliers. SCC can be expressed as follows:

[eqn]

where $[eqn]$ represents the difference in rank between the true and predicted values for sample $[eqn]$ .

For the DTA task, Concordance Index (CI) and MSE are used for evaluation. CI measures the proportion of correctly ranked outcome pairs and is calculated as follows

[eqn]

where $[eqn]$ is an indicator function that returns 1 if the condition is true and 0 otherwise. In addition, the quality of the molecular feature clusters is quantitatively assessed using the Davies–Bouldin Index (DBI), where a value closer to zero indicates better clustering performance. The formula for the DBI is as follows:

[eqn]

where $[eqn]$ represents intra-cluster average distance and $[eqn]$ refers to the inter-cluster center distance.

2.4 Method

This section provides a detailed description of the MMPCS. Figure 1 provides an overview of the model, which consists of two main stages: the pretraining stage and downstream task fine-tuning stage. In the pretraining stage, we first extract the molecular topology and SMILES sequence features. Subsequently, we introduce covariance loss to extract the consistency features and specific features of each view and align the consistency feature representations of the two views through consistency loss. In addition, to capture the holistic information of each view, we combine its specific information with the consistency information from the counterpart view to reconstruct the original features via an autoencoder. In the fine-tuning stage, to reduce the redundant information between views, we only concatenate the SMILES consistency representation with the view-specific representation from each view and pass the combined representation to the predictor for the final output.

Flowchart of the MMPCS framework, illustrating its pretraining and fine-tuning stages. (a) During pretraining, MMPCS uses covariance and alignment loss to extract information from two distinct molecular views. (b) In fine-tuning, MMPCS leverages both SMILES and molecular graphs to enhance performance on downstream tasks.

2.4.1 Molecular topological graph feature extraction

MMPCS uses a Graph Neural Network (GNN) for feature extraction from the molecular graph, $[eqn]$ . Following Hu et al. (Hu et al. 2020), the Graph Isomorphism Network (GIN) is adopted as the backbone network. For a $[eqn]$ -layer GIN, the molecular graph is initialized by incorporating atom-level and bond-level information. Specifically, the node features are initialized based on atom type $[eqn]$ and chirality $[eqn]$ , while the edge features are initialized from bond type $[eqn]$ and bond direction $[eqn]$ . All these feature representations are obtained through random initialization.

[eqn]

[eqn]

where $[eqn]$ are learnable parameters; The $[eqn]$ represents the initial embedding of node $[eqn]$ fed into the GNN; and $[eqn]$ represents the embedding of edge $[eqn]$ . For node $[eqn]$ at the $[eqn]$ -th layer ( $[eqn]$ ), the output embedding $[eqn]$ of the GIN can be expressed by Equation (10):

[eqn]

here $[eqn]$ denotes the set of neighbors of node $[eqn]$ . Here, the aggregation function $[eqn]$ is the sum function. To further extract the graph-level features $[eqn]$ , a $[eqn]$ operation is performed on all node features, as shown in Equation (11):

[eqn]

$[eqn]$ uses global average pooling. Finally, $[eqn]$ is fed into an MLP layer to better represent the holistic graph representation of the molecular graph, as shown in Equation (12):

[eqn]

where $[eqn]$ is the embedding of the 2D molecular topological graph.

2.4.2 SMILES sequence feature extraction

$[eqn]$ model (Liu et al. 1907) is used for feature extraction from the molecular SMILES sequence $[eqn]$ , as shown in Equation (13):

[eqn]

Here, $[eqn]$ is the tokenizer corresponding to $[eqn]$ , and $[eqn]$ is the encoded SMILES sequence.

[eqn]

Finally, $[eqn]$ is fed into an MLP layer to learn a more expressive embedding for the SMILES sequence, as shown in Equation (15):

[eqn]

where $[eqn]$ is the embedding of the SMILES sequence.

2.4.3 Consistency information extraction and alignment

Since multi-view data represent the same sample from different perspectives, they inherently share common (consistency) information, while also containing unique (specific) information. To disentangle these, covariance is used to separate the representations into consistency and view-specific information, thereby minimizing their mutual correlation. Finally, the consistency information across views is aligned to enhance representation learning.

Extracting consistency information and specific information. Covariance loss quantifies the correlation between two vectors. A positive value denotes a positive correlation between the two vectors, while a negative value indicates a negative correlation. Therefore, to fully decouple the consistency information from the specific information, we minimize their covariance loss, aiming for a value of zero which implies no linear relationship. Thus, the covariance loss between vectors $[eqn]$ and $[eqn]$ is given in Equation (16):

[eqn]

The inter-view consistency information, denoted as $[eqn]$ and $[eqn]$ , is taken from the first halves of the SMILES sequence embedding $[eqn]$ and the molecular graph embedding $[eqn]$ , respectively. While the specific information, denoted as $[eqn]$ and $[eqn]$ , is taken from their latter halves. Therefore, we have the following Equations (17) and (18):

[eqn]

[eqn]

where $[eqn]$ and $[eqn]$ share the same dimensions, and similarly, so do $[eqn]$ and $[eqn]$ . Therefore, the covariance loss for each view is defined by Equations (19) and (20) as follows:

[eqn]

[eqn]

Finally, the total covariance loss $[eqn]$ is obtained:

[eqn]

Algorithm 1:Pretraining Procedure of MMPCS1: Input: Dataset D of paired molecule SMILES $[eqn]$ and 2D molecular graphs $[eqn]$ .2: Initial: molecular graph encoder (GIN+MLP), molecular SMILES encoder (RoBERTa + MLP), Consistency Projectors $[eqn]$ , $[eqn]$ , Molecule Graph and SMILES sequence feature reconstruction autoencoder $[eqn]$ , $[eqn]$ .3: for *epoch = 1 to Pretrain_Epoch do 4: for each batch $[eqn]$ from $[eqn]$ do 5: / Obtain 2D graph embeddings $[eqn]$ and SMILES embeddings $[eqn]$ from views $[eqn]$ and $[eqn]$ /6: Obtain the 2D graph embeddings $[eqn]$ [Equations (8)–(12)]7: Obtain the SMILES embeddings $[eqn]$ [Equations (13)–(15)]8: / Split embeddings into consistency and specific parts /9: Obtain the consistency information $[eqn]$ and specific information $[eqn]$ of SMILES view [Equation (17)]10: Obtain the consistency information $[eqn]$ and specific information $[eqn]$ of 2D graph view [Equation (18)]11: Calculate Covariance Loss $[eqn]$ [Equations (19)–(21)]12: / Align the consistency information /13: Predict graph consistency information $[eqn]$ from $[eqn]$ [Equations (22) and (23)]14: Predict SMILES consistency information $[eqn]$ from $[eqn]$ [Equations (24) and (25)]15: Calculate consistency information alignment losses $[eqn]$ and $[eqn]$ [Equations (27) and (28)]16: / View feature reconstruction /17: Reconstruct the 2D graph embeddings based on $[eqn]$ and $[eqn]$ [Equations (29) and (30)]18: Reconstruct the SMILES embeddings based on $[eqn]$ and $[eqn]$ [Equations (31) and (32)]19: Calculate reconstruction loss $[eqn]$ and $[eqn]$ [Equations (33) and (34)]20: / Calculate the overall alignment loss and the overall pretraining loss */21: Obtain the overall alignment loss $[eqn]$ [Equation (35)]22: Calculate pretraining Loss $[eqn]$ through Equation (36) and update parameters23: end for 24: end for 25: Return Pretraining parameters

Aligning the consistency information. We introduce a consistency alignment mechanism that projects both views into a common embedding space using projection networks with identical architectures. This mechanism not only aligns consistency information bidirectionally between the two views but also supports cross-view reconstruction: the consistency information from one view, when combined with the specific information of another view, can reconstruct the complete representation of that other view. This design is motivated by the shared nature of consistency information and the complementary role of view-specific information.

Specifically, we use a SMILES-to-graph projector ( $[eqn]$ ) to predict the consistency information $[eqn]$ of the molecular graph from the consistency information $[eqn]$ of the SMILES, as shown in Equations (22) and (23):

[eqn]

[eqn]

where $[eqn]$ is obtained by encoder $[eqn]$ . Then, $[eqn]$ is decoded by decoder $[eqn]$ to predict the consistency information $[eqn]$ . Both the encoder and decoder each consist of a three-layer MLP. Symmetrically, we use a graph-to-SMILES projector ( $[eqn]$ ) to predict the SMILES consistency information $[eqn]$ . This process is formulated in Equations (24) and (25):

[eqn]

[eqn]

where $[eqn]$ is obtained by encoder $[eqn]$ . Then, $[eqn]$ is decoded by decoder $[eqn]$ to predict the consistency information $[eqn]$ .

Based on the above alignment rationale, two consistency information alignment losses are defined: $[eqn]$ for predicting the graph view’s consistency information from that of the SMILES view, and $[eqn]$ for predicting the SMILES view’s consistency information from that of the graph view. Let $[eqn]$ denote the ground-truth consistency vector, and $[eqn]$ represents the reconstructed prediction. The Mean Squared Error (MSE) used to calculate the consistency information alignment loss is expressed as Equation (26):

[eqn]

$[eqn]$ can be obtained using Equation (27):

[eqn]

$[eqn]$ is defined by Equation (28):

[eqn]

View feature reconstruction. As established, the shared consistency information enables cross-view reconstruction. Accordingly, the complete representation of each view is reconstructed by combining its specific information with the consistency information from the other view. To achieve this, we implement a Molecule Graph Embedding Projector (MGEP) and a Molecule SMILES Embedding Projector (MSEP), wherein both are designed as three-layer MLPs to mutually reconstruct the respective view representations and enhance model learning. The reconstruction of the molecular graph embedding is formally defined in Equations (29) and (30) as follows:

[eqn]

[eqn]

Similarly, the reconstruction of the SMILES sequence embedding is defined in Equations (31) and (32) as follows:

[eqn]

[eqn]

The reconstruction performance is assessed with two loss functions. The first measures the reconstruction loss for the molecular graph embedding as defined by Equation (33), while the second quantifies the loss for the SMILES sequence embedding reconstruction using Equation (34).

[eqn]

[eqn]

Thus, the total alignment loss $[eqn]$ between the SMILES and 2D molecular topological graph views is computed as the sum of the four component losses, given by Equation (35):

[eqn]

The overall pretraining loss $[eqn]$ is defined by Equation (36) as the weighted sum of $[eqn]$ and $[eqn]$ , where the coefficients $[eqn]$ and $[eqn]$ balance their respective contributions.

[eqn]

The pseudocode summarizing the MMPCS pretraining process is presented in Algorithm 1.

3 Results

3.1 Empirical setting

To ensure a fair comparison, both MMPCS and the baseline methods were pretrained for the same number of epochs on identical datasets. For the downstream tasks, we adopted a consistent data-splitting strategy, partitioning each dataset into training, validation, and test sets in an 8:1:1 ratio. During fine-tuning, all prediction tasks were fine-tuned simultaneously, with each task using a dedicated projection head. The inputs for these tasks integrated both view-consistency and view-specific information. The key hyperparameters and their configurations for both pretraining and fine-tuning stages are detailed in Table 4.

3.2 Comparison of MMPCS with baseline methods for property prediction

To evaluate the effectiveness of our pretraining model, it was benchmarked against 16 classic and state-of-the-art methods which are categorized into three groups:

Sequence-based pretraining methods: SMILES-Transformer (Honda et al. 2019).Graph-based pretraining methods: GraphCL (You et al. 2020), JOAO (You et al. 2021), JOAOv2 (You et al. 2021), GraphLoG (Xu et al. 2021), Lp-Info (You et al. 2022), SimGRACE (Xia et al. 2022), AD-GCL (Suresh et al. 2021), MGSSL (Zhang et al. 2021), GraphMAE (Hou et al. 2022), MolCLR (Wang et al. 2022), Mole-BERT (Xia et al. 2023b).Multi-view pretraining methods: GraphMVP (Liu et al. 2022), 3DInfomax (Stärk et al. 2022), MEMO (Zhu et al. 2022), and FragCL (Kim et al. 2023).

The performance comparison of various methods on molecular attribute classification tasks is summarized in Table 5. MMPCS achieved an average ROC-AUC of 86.35% across seven benchmark datasets, surpassing existing methods such as MolCLR and MGSSL. This notable improvement highlights the efficacy of our pre-training framework. Furthermore, MMPCS exhibited strong performance on toxicity- and side effect-related datasets, including Tox21, ToxCast, SIDER, and ClinTox. On ToxCast, it attained a ROC-AUC of 86.81%, representing a 9.7% gain over the second-best method (SMILES-Transformer). Similarly, it achieved 82.55% on SIDER and 98.76% on ClinTox, indicating substantial improvements over baseline models. These gains can be attributed to the MMPCS’s pre-training framework that simultaneously: (i) suppresses irrelevant perturbations, (ii) compresses redundant features, and (iii) retains critical structural information. This capability is particularly important for toxicity prediction tasks that are highly sensitive to specific substructures such as nitro groups, halogens, and conjugated systems (Kazius et al. 2005, Hao et al. 2019). The comparatively weaker performance of perturbation-driven contrastive graph pretraining methods (such as GraphCL, JOAO, and JOAOv2) further supports our hypothesis that excessive structural perturbations may hinder the preservation of chemically critical substructures during molecular representation learning. On property prediction tasks across the BBBP, Estrogen, and MetStab datasets, MMPCS delivered strong overall results. It achieved top performance on the BBBP and Estrogen datasets and ranked second on MetStab. Notably, despite MEMO leveraging four molecular views, MMPCS surpassed it with only two views. The experimental outcomes collectively demonstrate that MMPCS effectively integrates multi-modal molecular information to produce comprehensive and discriminative molecular representations.

To further evaluate the generalizability of MMPCS, it was compared with baseline models on five downstream regression tasks. As summarized in Table 6, MMPCS achieved the best or second-best results on the four out of the five datasets and maintained the highest average performance, with particularly notable improvement on the ESOL dataset. These results demonstrate the competitiveness and robustness of MMPCS in regression settings. It is worth noting that multi-view baselines failed to achieve comparable performance, implying that the effective capture of view-specific information is crucial for regression tasks. Moreover, MMPCS outperformed all two-view models and surpassed even the four-view model MEMO by a significant margin.

3.3 MMPCS for other drug-related tasks

To evaluate the effectiveness of molecular representations in drug-related tasks, we leveraged the pre-trained MMPCS model to generate representations for the CDR and DTA prediction tasks. For CDR prediction, the baseline methods included DeepTTA (Jiang et al. 2022), DeepCDR (Liu et al. 2020), tCNNs (Liu et al. 2019), CDRscan (Chang et al. 2018), and MOLI (Sharifi-Noghabi et al. 2019). For DTA prediction, the compared methods were GraphDTA(Nguyen et al. 2021) WideDTA (Öztürk et al. 2019), DeepDTA (Öztürk et al. 2018), SimBoost (He et al. 2017), and KronRLS (Cichonska et al. 2018). In these experiments, the original drug-encoding modules of the baseline models were substituted with representations generated by MMPCS. Specifically, the drug-processing component of DeepTTA was substituted in the CDR prediction task, and similarly, that of GraphDTA was replaced in the DTA prediction task. To maintain architectural consistency, the MMPCS representations were adapted to the required hidden dimensions via a lightweight linear projection layer, while the protein-encoding components of both DeepTTA and GraphDTA remained unchanged in structure and parameters.

As summarized in Tables 7 and 8, MMPCS achieved the best performance across all evaluation metrics, surpassing all baseline methods. These results demonstrate that the molecular representations learned by MMPCS possess strong generalization capabilities and can be effectively transferred to improve performance in key drug discovery tasks.

To further validate its practical utility, MMPCS was applied to an important real-world task, antiviral drug repurposing for the SARS-CoV-2 Omicron variant. We selected four major variants (BA.2.75, BA.4/BA.5, BA.4.6, BA.1) and screened 1615 FDA-approved drugs from the ZINC database as potential inhibitors. GraphDTA, fine-tuned on KIBA dataset, incorporated MMPCS for drug processing to predict drug-RBD affinity. Based on their predicted binding affinity against four Omicron variants, the ten top-ranked candidates were selected for further analysis. As illustrated in Fig. 2, several compounds with potential antiviral activity were identified, meriting further validation. Among them, three drugs highlighted in red have been documented to inhibit SARS-CoV-2. Specifically, Abemaciclib (ZINC000072318121) effectively inhibits SARS-CoV-2 replication in vitro, reduces cytopathic effects, and demonstrates favorable antiviral efficacy and safety in animal models(Wang et al. 2021). Copanlisib (ZINC000068247389) mimics the binding mode of VIR251, showing high affinity and potential as an inhibitor. It significantly inhibits Omicron pseudo viruses by reducing viral entry via endocytosis, highlighting its promise for infection prevention (Kuzikov et al. 2022). Brigatinib (ZINC000148723177), a multi-target tyrosine kinase inhibitor, potently suppresses SARS-CoV-2 spread. In vivo studies have demonstrated that oral administration of Brigatinib can reduce viral load and alleviate lung inflammation in mice, though further clinical trials are still required to confirm its safety and efficacy in humans (Shin et al. 2024). Collectively, these experimental findings further corroborate the effectiveness of the MMPCS-based drug screening approach in identifying potential antiviral candidate compounds.

Sankey diagram visualizes the predicted drug–target affinities between four SARS-CoV-2 Omicron variant proteins (top) and the top 10 FDA-approved drugs (bottom). The width of each ribbon represents the predicted affinity strength, where a thicker ribbon denotes a stronger affinity.

3.4 Ablation study

In this section, we conduct ablation studies on seven representative datasets from MoleculeNet to evaluate the contribution of each component in the MMPCS framework. The selected datasets cover diverse classification tasks (BBBP, SIDER, ClinTox and Estrogen) and regression tasks (ESOL and FreeSolv), encompassing a wide range of molecular properties and task characteristics. In the following subsections, we systematically evaluated the impact of network architecture design and decorrelation loss functions to validate the effectiveness and rationality of each MMPCS component.

3.4.1 The influence of model components

In this section, we conducted ablation studies from three perspectives: loss function design, view-specific information utilization and pretraining usage. First, to investigate the necessity of combining both loss functions, we constructed two model variants: one using only the consistency alignment loss ( $[eqn]$ ) and the other using only the covariance loss ( $[eqn]$ ). Second, to evaluate the contribution of view-specific information, we fine-tuned the model using solely the SMILES sequence ( $[eqn]$ ) or the molecular graph ( $[eqn]$ ). Fine-tuning with a single view is equivalent to discarding the specific information from the other view during inference, which may limit the representational capacity of the model. Finally, to assess the importance of pretraining, we evaluated a non-pretrained variant ( $[eqn]$ ).

As shown in Table 9, the ablation studies revealed several key findings. First, MMPCS consistently outperformed $[eqn]$ and $[eqn]$ across most tasks, especially on BBBP, ESOL, and FreeSolv. This demonstrates that jointly optimizing both losses functions is essential for effectively balancing shared and view-specific representation learning. Second, fine-tuning with both views outperformed the single-view variants, highlighting the complementary nature of SMILES and graph representations. On SIDER, where inter-view consistency information plays a dominant role, the performance gap between single-view and dual-view models is small. In contrast, for tasks such as SIDER and ClinTox, which are more sensitive to local molecule structures, the graph provides critical view-specific information. For ESOL, SMILES contributes more to prediction, while on FreeSolv, the graph view is more informative. These observations collectively confirm the task-dependent complementarity of the two molecular modalities. Third, the non-pretrained variant $[eqn]$ exhibits noticeable performance degradation across all datasets. For example, on BBBP, MMPCS achieves a ROC-AUC of 73.41, while $[eqn]$ drops to 70.02. A similar trend is observed on ESOL, where the RMSE of $[eqn]$ increases by 0.072 compared to the fully pretrained model. These results indicate that pretraining process enables the model to learn transferable molecular knowledge that benefits downstream tasks.

3.4.2 The influence of decorrelation loss functions

To systematically assess the efficacy of the covariance loss in MMPCS for representation decoupling, we benchmarked MMPCS against three established decorrelation approaches: Orthogonality Loss (OL), Hilbert-Schmidt Independence Criterion (HSIC), and Mutual Information Neural Estimation (MINE). Across all comparisons, only the covariance loss was substituted, while all other settings remained unchanged.

As shown in Table 10, the covariance loss achieved either superior or highly competitive performance across all five benchmark datasets, achieving the highest average performance among all evaluated methods. This validates its effective balance between decorrelation effectiveness and training stability, establishing it as a practical and reliable choice for multi-view molecular representation learning in MMPCS.

3.5 The impact of hyperparameters

To further investigate the influence of hyperparameters on model performance, we conducted a sensitivity analysis on the weighting coefficients of the two primary loss components: the covariance loss (controlled by the $[eqn]$ ) and the consistency alignment loss (controlled by $[eqn]$ ). In this study, $[eqn]$ was fixed at 1, while $[eqn]$ was varied across the set {0.1, 0.2, 0.3, 0.5, 1, 2, 3, 5, 10}. The performance of MMPCS under these loss weight configurations was evaluated on five representative datasets.

The performance trends under varying $[eqn]$ (with α fixed at 1.0) are shown in Fig. 3. A consistent optimum is observed at $[eqn]$ =1.0 across all tasks, with regression achieving its lowest RMSE and classification tasks their highest scores. This indicates that equally weighting the consistency and covariance losses (i.e. $[eqn]$ =1.0) creates an optimal balance, allowing the model to leverage both objectives effectively and thus achieve the best overall performance.

Performance trends of MMPCS under varying loss weight β (with α fixed at 1.0) for (a) classification and (b) regression tasks.

3.6 Sensitivity analysis of GIN layers

To investigate the effect of GIN network depth on model performance, we evaluated architectures with 2, 3, and 4 layers across several representative datasets, including BBBP, SIDER, and Estrogen for classification, and ESOL and FreeSolv for regression. As shown in Fig. 4, the model with three GIN layers achieves the best or near-best performance across most datasets. In classification tasks, the three-layer GIN attains a ROC-AUC of 73.41% on BBBP, substantially outperforming both the two- and four-layer variants, while also maintains stable superiority on SIDER and Estrogen. In regression tasks, it achieves the lowest RMSE on both ESOL and FreeSolv, demonstrating its stronger numerical approximation capability. Overall, reducing the depth to two layers limits the model’s expressive power, whereas increasing it to four layers results in performance degradation, likely due to over-smoothing and training instability. These findings collectively indicate that the three-layer GIN provides an optimal balance between representational capacity and model stability.

Effect of GIN network depth on model performance across (a) classification and (b) regression datasets.

3.7 Visualization

For the downstream tasks, MMPCS utilizes inter-view consistency and view-specific information as inputs. Unlike a simple concatenating of the two view embeddings, MMPCS reduces the feature dimensionality by 25%. This efficiency gain is achieved by avoiding the duplication of inter-view consistency information. To demonstrate the effectiveness of this approach, t-SNE visualizations were generated for all molecules in the Estrogen dataset using three distinct representations: (1) $[eqn]$ : SMILES embedding concatenated with the view-specific information of the molecular graph; (2) $[eqn]$ : the view-specific information of the SMILES concatenated with the embedding of the molecular graph, and (3) $[eqn]$ : direct concatenation of the SMILES and molecular graph embeddings. The Estrogen dataset supports two binary-classification tasks, $[eqn]$ and $[eqn]$ , which are displayed in the first and second rows of Fig. 5, respectively, with the legend indicating the ground-truth labels for each task.

t-SNE visualization of molecular representations fine-tuned on the Estrogen dataset, which contains two binary classification tasks (α and β). The first row (a–c) corresponds to the α task, and the second row (d–f) to the β task. (a) and (d): SMILES embedding concatenated with the view-specific information of the molecular graph (Se∥geu); (b) and (e): view-specific information of the SMILES sequence concatenated with the embedding of the molecular graph (Seu∥ge); (c) and (f): direct concatenation of the SMILES and molecular graph embeddings (Se∥ge).

The t-SNE visualizations reveal two key findings. First, the strong similarity between $[eqn]$ (Fig. 5a) and $[eqn]$ (Fig. 5b) indicates that MMPCS effectively extracts consistent inter-view information, regardless of which view provides the base embedding. Similar trends are observed in Fig. 5d and e. Additionally, $[eqn]$ (Fig. 5a) closely matches $[eqn]$ (Fig. 5c), confirming our method preserves information while reducing input dimensionality, as further supported by Fig. 5d and f.

The nearly identical DBI values across the three different connection methods, specifically 0.72, 0.74, and 0.74 for task α in Fig. 5(a)–(c), and 0.89, 0.91, and 0.91 for task β in Fig. 5(d)–(f), confirm their comparable clustering performance, thereby validating the reliability of our method. This is visually supported by the high similarity between the first two columns in Fig. 5, indicating that SMILES and molecular graph embeddings encode nearly identical consistency information. Furthermore, the close resemblance between these and the third column demonstrates that MMPCS achieves significant dimensionality re duction without losing critical information.

4 Conclusion

In this study, we proposed MMPCS, a multi-view pretraining framework that effectively captures both inter-view consistency and view-specific information without relying on data augmentation techniques. This design avoids the risk of excessively altering molecular properties through data augmentation, which could otherwise lead to biased representations. Furthermore, by eschewing chemistry-specific augmentation strategies and rule-based substructure design, MMPCS minimizes its dependence on manually crafted domain knowledge, while still leveraging general molecular priors learned through data-driven pretraining. Extensive experiments demonstrated that MMPCS achieves superior performance, outperforming existing methods on average across eight classification datasets and five regression tasks. It also achieves state-of-the-art performance in DTA and CDR prediction, and demonstrates promising potential for COVID-19 drug repositioning. Ablation studies confirm the necessity of combining both loss functions and highlights the importance of incorporating view-specific information. Visualization results further validate the effectiveness of our approach in generating compact yet informative representations for downstream tasks. Looking forward, we plan to extend MMPCS by incorporating additional molecular views during pretraining. A current limitation is the requirement for consistency and view-specific information to share identical dimensions within the same view. Future work will focus on developing more flexible mechanisms to acquire this information adaptively.

Bibliography70

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Ahmad W, Simon E, Chithrananda S et al Chemberta-2: Towards chemical foundation models. ar Xiv, ar Xiv: 2209.01712, 2022, preprint: not peer reviewed.
2Cereto-Massagué A , Ojeda MJ, Valls C et al Molecular fingerprint similarity search in virtual screening. Methods 2015;71:58–63.25132639 10.1016/j.ymeth.2014.08.005 · doi ↗ · pubmed ↗
3Chang Y , Park H, Yang H-J et al Cancer drug response profile scan (CD Rscan): a deep learning model that predicts drug effectiveness from cancer genomic signature. Sci Rep 2018;8:8857.29891981 10.1038/s 41598-018-27214-6PMC 5996063 · doi ↗ · pubmed ↗
4Chapman PB , Hauschild A, Robert C et al BRIM-3 Study Group. Improved survival with vemurafenib in melanoma with BRAF V 600E mutation. N Engl J Med 2011;364:2507–16.21639808 10.1056/NEJ Moa 1103782 PMC 3549296 · doi ↗ · pubmed ↗
5Chen Y , Wen Y, Xie C et al MOCSS: multi-omics data clustering and cancer subtyping via shared and specific representation learning. i Science 2023;26:107378.37559907 10.1016/j.isci.2023.107378 PMC 10407241 · doi ↗ · pubmed ↗
6Cichonska A , Pahikkala T, Szedmak S et al Learning with multiple pairwise kernels for drug bioactivity prediction. Bioinformatics 2018;34:i 509–18.29949975 10.1093/bioinformatics/bty 277PMC 6022556 · doi ↗ · pubmed ↗
7Delaney JS. ESOL: estimating aqueous solubility directly from molecular structure. J Chem Inf Comput Sci 2004;44:1000–5.15154768 10.1021/ci 034243 x · doi ↗ · pubmed ↗
8Du Y, Fu T, Sun J et al Molgensurvey: A systematic survey in machine learning models for molecule design. ar Xiv, ar Xiv: 2203.14500, 2022, preprint: not peer reviewed.