Improving Identification of Drug-Target Binding Sites Based on Structures of Targets Using Residual Graph Transformer Network

Shuang-Qing Lv; Xin Zeng; Guang-Peng Su; Wen-Feng Du; Yi Li; Meng-Liang Wen

PMC · DOI:10.3390/biom15020221·February 3, 2025

Improving Identification of Drug-Target Binding Sites Based on Structures of Targets Using Residual Graph Transformer Network

Shuang-Qing Lv, Xin Zeng, Guang-Peng Su, Wen-Feng Du, Yi Li, Meng-Liang Wen

PDF

Open Access

TL;DR

This paper introduces RGTsite, a new deep learning model that improves the identification of drug-target binding sites using a residual graph transformer network.

Contribution

The novel Residual Graph Transformer Network (RGTsite) improves drug-target binding site prediction by fusing sequence and structural features with better performance than existing methods.

Findings

01

RGTsite outperformed state-of-the-art methods in F1-score and MCC metrics on multiple benchmark datasets.

02

Interpretability analysis confirmed RGTsite's effectiveness in identifying drug-target binding sites in real-world cases.

Abstract

Improving identification of drug-target binding sites can significantly aid in drug screening and design, thereby accelerating the drug development process. However, due to challenges such as insufficient fusion of multimodal information from targets and imbalanced datasets, enhancing the performance of drug-target binding sites prediction models remains exceptionally difficult. Leveraging structures of targets, we proposed a novel deep learning framework, RGTsite, which employed a Residual Graph Transformer Network to improve the identification of drug-target binding sites. First, a residual 1D convolutional neural network (1D-CNN) and the pre-trained model ProtT5 were employed to extract the local and global sequence features from the target, respectively. These features were then combined with the physicochemical properties of amino acid residues to serve as the vertex features in…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Genes1

CLEC3B

Proteins1

Species1

Homo sapiens(human · species)

Cell lines3

PATP-NW30-103— Rattus norvegicus (Rat) · Spontaneously immortalized cell line 1OJL-E— Mus musculus (Mouse) · Malignant neoplasms of the mouse mammary gland · Cancer cell line 2Z08-A.— Homo sapiens (Human) · Melanoma · Cancer cell line

Chemicals4

amino acid ATP PATP-GTN

Diseases2

TN injury to people or property

Figures3

Click any figure to enlarge with its caption.

Funding5

—National Natural Sciences Foundation of China
—Yunnan Young and Middle-aged Academic and Technical Leaders Reserve Talent Project in China
—Talent introduction and scientific research start-up fund project WYUAS
—Research Project on Undergraduate Education and Teaching Reform in Yunnan Province in 2024
—State Key Laboratory for Conservation and Utilization of Bio-Resources in Yunnan, Yunnan University

Keywords

drug-target binding sitesstructureresidual graph transformer networkresidual 1D-CNN

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsComputational Drug Discovery Methods · Machine Learning in Bioinformatics · Protein Structure and Dynamics

Full text

1. Introduction

Identification of drug-target binding sites (also known as pockets) is of great significance in drug screening and design [1,2]. For example, ATP (Adenosine triphosphate) binding sites play a crucial role in cellular-energy metabolism, signal transduction, and protein function. Over the years, the computational methods for identifying drug-target binding sites have evolved and can be broadly divided into two categories: traditional methods [3] and machine learning approaches [4,5,6,7]. Traditional methods include an energy-based approach Q-SiteFinder [8], which uses the interaction energy between the protein and a simple van der Waals probe to identify energetically favorable binding sites, a docking-based approach [9], which utilizes docking tools to locate the binding sites, and Fpocket, which is based on Voronoi tessellation and alpha spheres [10,11]. While some of these methods, such as Fpocket, have demonstrated strong performance in the identification of binding sites and are widely used, traditional methods are increasingly showing limitations, including insufficient information fusion and slow computational speed as the number of target structures grows. In contrast, machine learning methods have shown superior performance in identifying binding sites and faster computational efficiency, especially when handling large-scale and complex datasets [12,13,14]. As a result, the continued development of machine learning methods for drug-target binding sites identification is expected to accelerate the drug discovery process [15]. For instance, prediction models of ATP binding sites can not only improve the efficiency of drug screening, but also assist in identifying potential drug targets, designing more accurate targeted drugs, and supporting new drug development.

In modern drug development, computation-based drug screening and wet-lab experiments for drug activity verification are two critical steps, with drug screening playing a critical role in narrowing the scope of wet-lab experiments. Computation-based drug screening process can be divided into three main components: drug-target interaction prediction, drug-target binding affinity prediction, and drug-target binding sites prediction. Research in these areas heavily relies on machine learning techniques. Drug-target interaction prediction models help screen small molecules with activity against specific targets from large molecular databases [16,17,18]. Once potential active small molecules are identified, the next step is to filter out those with weak binding affinities. As a result, machine learning models for predicting drug-target binding affinity have garnered significant attention, including models like MDF-DTA [19], MMDTA [20], and MvGraphDTA [21]. Although drug-target interaction and affinity prediction models can substantially reduce the pool of potential candidates, accurately identifying the specific binding sites of small molecules remains a significant challenge. This is due to the presence of multiple binding pockets on the surface of a target. Consequently, drug-target binding sites prediction models become particularly important. These models can be classified into two categories: classic machine learning methods and deep learning approaches. Classic machine learning methods predict drug-target binding sites by incorporating the shallow features of drugs and targets, such as molecular fingerprints, physicochemical properties of amino acid residues, sequence encodings, etc., into models like random forest [22], support vector machine [23], and XGBoost [24,25]. While these approaches have contributed valuable insights into binding sites identification, their performance is reaching a bottleneck due to their reliance on shallow feature representations of both drugs and targets. In contrast, deep learning methods can leverage not only shallow features but also extract deep-level features that affect binding sites identification. For example, methods [26,27,28] employed 3D convolutional neural networks to extract structural features from the 3D structures of targets. Additionally, models [29,30,31,32] represented the 3D structures of targets as graphs, using graph neural networks to extract structural features and classifying vertices in the graph. Moreover, some methods enhance their prediction capabilities by incorporating deep neural networks and self-attention mechanisms to capture deep-level features of targets [33,34,35]. Despite the advancements made by current drug-target binding sites prediction models, particularly through the integration of diverse data modalities such as physicochemical properties of amino acid residues, target sequences, and target 3D structures, there remains considerable room for performance improvement. Challenges such as insufficient deep-level feature extraction, dataset imbalances, and incomplete model development continue to hinder the full potential of these models in drug-target binding sites prediction.

To further improve the performance of drug-target binding sites identification, we proposed a novel deep learning framework called RGTsite. This framework combined a residual 1D convolutional neural network (1D-CNN) with a residual graph transformer network (GTN) to extract deep-level features from both the sequences and structures of targets. Ultimately, drug-target binding sites were predicted through a fully connected network. Overall, RGTsite made three key contributions as follows.

(1)RGTsite employed a residual 1D-CNN along with the pre-trained model ProtT5 to extract the local and global sequence features from the target. These sequence features were then combined with the physicochemical properties of amino acid residues to construct comprehensive feature representations for each vertex in a target chain graph. Additionally, RGTsite incorporated six-dimensional edge features. These enhancements further strengthened the representational capacity of graph.
(2)Building upon the standard GTN, we introduced a residual technique to aggregate edge features into the vertex features, forming a novel residual GTN. This innovation allowed for a more thorough capture of structural information.
(3)On multiple benchmark datasets, RGTsite exhibited the superior performance compared to the state-of-the-art methods, despite class imbalance, in terms of Matthews Correlation Coefficient (MCC) and F1-score (F1). Furthermore, we conducted interpretability analysis for RGTsite through the specific cases, which further validated the efficiency and effectiveness of RGTsite as a tool for drug-target binding sites identification.

2. Materials and Methods

2.1. Proposed Model

In this study, we proposed a drug-target binding sites identification model, RGTsite (Figure 1), which was based on a GTN architecture. The input to RGTsite consisted of three components: target chain graph, vertex features, and edge features. The construction of the target chain graph and edge features are detailed in Section 2.3, while the vertex features were composed of both sequence features and physicochemical features of amino acid residues: (1) A label encoding representation of the target sequence, which was first mapped into a 1024-dimensional feature matrix through a predefined vocabulary and embedding dimension (embedding layer). This feature matrix was then passed through a 3-layer residual 1D-CNN, with padding sizes of 1, 2, and 3 for each layer and convolution kernels of 3 × 3, 5 × 5, and 7 × 7 for the respective layers. This process resulted in a 1024-dimensional representation of local sequence features. (2) After inputting the target sequence into the pre-trained model ProtT5, a 1024-dimensional global sequence feature matrix was obtained for each amino acid residue. (3) Finally, by concatenating the 36-dimensional physicochemical features of amino acid residue with the obtained sequence features, a 2084-dimensional feature matrix was generated and used as the vertex features for the target chain graph.

The target chain graph, vertex features, and edge features were fed into the residual GTN. First, a multi-head attention mechanism (with 8 heads) was employed to extract the vertex and edge features, which were then linearly transformed through a fully connected layer. Using the residual connection method, the original vertex features and edge features were added element-wise to the transformed features, updating both the vertex features and edge features. Next, the updated features were passed into a graph convolutional network with a single hidden layer (256 neurons) to extract the deeper vertex features. Simultaneously, the updated vertex features were input into a fully connected network for linear transformation and then added to the vertex features output by the graph convolutional network. Finally, after undergoing normalization, dropout, feedforward neural network processing, and residual operation, the network output a 256-dimensional vertex feature matrix. This matrix was then fed into a three-layer fully connected network, with sigmoid activation functions applied between layers. The network architecture comprised 64, 16, and 1 neuron in each layer, respectively, to predict drug-target binding sites.

2.2. Datasets

In this study, we employed three standard datasets, including PATP-429, PATP-1930, and PATP-NW30 (Table 1). These datasets consisted of target sequences bound to ATP, and these sequences were sourced from the Protein Data Bank (PDB), with variations in data collection years and preprocessing methods. As of 5 November 2016, PATP-429 was constructed, containing 429 sequences. Redundant sequences with greater than 40% sequence consistency were removed using the CD-HIT tool [36]. By 23 June 2023, the NW-Align tool [37] was used to cluster sequences at 40% sequence identity, resulting in the creation of PATP-1930. Additionally, using NW-Align with a 30% sequence identify threshold, PATP-NW30 was generated. As shown in Table 1, it was noted that PATP-1930 was not divided into an independent test set, instead, it directly used the test set of PATP-429. In this process, amino acid residues that bind to ATP were labeled as positive samples, while other residues in the chain served as negative samples. There were significant differences in the positive-to-negative sample ratios in the training and test sets across PATP-429, PATP-1930, and PATP-NW30, all of which were typical imbalanced datasets.

2.3. Feature Representation

We employed two ways to represent the sequence features of a target chain. (1) Label encoding: Each target chain generally contained 20 common amino acids, each represented by a number from 1 to 20. Any uncommon amino acids were grouped together and represented by the number 21. (2) Pre-trained model for feature extraction: Given the large number of target sequences, relying solely on extracting deep-level features from a limited set of sequences would not provide comprehensive representations. To address this issue, we leveraged the widely used pre-trained model ProtT5-XL-U50 (ProtT5) [38], which was pre-trained and fine-tuned on BFD (Big Fantastic Database) [39] and UniRef50 [40], covering target sequences of multiple biological species. ProtT5 was applied to extract a 1024-dimensional feature vector for each amino acid residue in a target chain.

First, we utilized AlphaFold2 [41] to predict the 3D structures of targets. There are two key reasons for choosing AlphaFold2: (1) The 3D structures predicted by AlphaFold2 are particularly useful for identifying binding sites on an ATP-target with unknown structures, and it has strong universality. (2) AlphaFold2, developed by DeepMind, can predict 98.5% of human protein structures. In collaboration with EMBL-EBI, DeepMind has constructed the AlphaFold DB database, which contains a large amount of high-precision protein structure prediction data. While there may be cases where the predicted structure of certain proteins (or parts of them) is inaccurate or “degraded”, such occurrences are rare. Next, we treated the $[eqn]$ atoms of amino acid residues in a target chain as vertices of a graph. The Euclidean distance between every pair of vertices was then calculated. If the distance was less than or equal to 8 Å, an edge was considered to exist between them, and the corresponding entry in the adjacency matrix representing the graph was set to 1. Otherwise, the entry was set to 0.

Physicochemical features of amino acid residues were represented by 16 categories, yielding a total of 36 dimensions (Table 2).

Edge features between amino acid residues were characterized by 6 dimensions (Table 2).

2.4. Residual Graph Transformer Network

In the target chain graph $[eqn]$ , the features of vertex $[eqn]$ are represented by $[eqn]$ ( $[eqn]$ is the vertex feature dimension), and the edge features between vertices $[eqn]$ and $[eqn]$ are represented by $[eqn]$ ( $[eqn]$ is the edge feature dimension). After linear projection, the implicit representations of the vertex and edge features are $[eqn]$ (Formula (1)) and $[eqn]$ (Formula (2)), respectively.

[eqn]

[eqn]

Among them, $[eqn]$ and $[eqn]$ are learnable parameters, while $[eqn]$ are biases ( $[eqn]$ is the dimension of the feature matrix). To incorporate positional encoding into the vertex features, we use the Laplacian vector as the positional encoding in the graph transformer network. The Laplacian feature vector ∆ of each graph is pre-calculated through the factorization of the graph’s Laplacian matrix (Formulas (3)–(5)).

[eqn]

[eqn]

[eqn]

In Formulas (3) and (4), $[eqn]$ minimum non-trivial feature vectors of $[eqn]$ are used to form its corresponding positional encoding representation $[eqn]$ ( $[eqn]$ ). Here, $[eqn]$ is a learnable parameter, $[eqn]$ represents the bias, $[eqn]$ and $[eqn]$ denote the vertex feature matrix and the positional encoding feature matrix, respectively. In Formula (5), $[eqn]$ is the identity matrix, $[eqn]$ is the adjacency matrix of graph, $[eqn]$ is the degree matrix of graph, and $[eqn]$ and $[eqn]$ represent the eigenvalues and eigenvectors, respectively. Graph transformer updates the vertex and edge features primarily through the multi-head attention mechanism. The update process for vertex and edge features at the $[eqn]$ -th layer is shown in Formulas (6)–(8).

[eqn]

[eqn]

[eqn]

$[eqn]$ , $[eqn]$ , $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ are learnable parameters, where $[eqn]$ represents the number of heads and $[eqn]$ indicates the dimension of each head. $[eqn]$ signifies the self-attention weight matrix incorporating the edge features. The resulting edge feature matrix is used as the edge weight matrix $[eqn]$ , from which the adjacency matrix $[eqn]$ of the Laplacian edge weighted vertex $[eqn]$ is derived (Formula (9)).

[eqn]

$[eqn]$ is the diagonal matrix of $[eqn]$ , where $[eqn]$ and $[eqn]$ represent the adjacency matrix and identity matrix, respectively. $[eqn]$ denotes the dimension of the feature matrix, $[eqn]$ represents the edge weight, and $[eqn]$ indicates the target. By using $[eqn]$ and the residual vertex feature $[eqn]$ from the $[eqn]$ -th layer, the residual vertex feature $[eqn]$ of the $[eqn]$ -th layer is obtained (Formula (10)).

[eqn]

$[eqn]$ represents the learnable parameter matrix. $[eqn]$ is fed into the feedforward neural network, where the vertex features $[eqn]$ are obtained through normalization and activation function operations (Formula (11)).

[eqn]

$[eqn]$ and $[eqn]$ are the learnable parameter matrices.

2.5. Model Training

In this study, to prevent overfitting, we adopted an early-stopping mechanism during RGTsite training. The early-stopping strategy was as follows: 10% of target chains from the training set were randomly selected as the validation set. If the validation loss did not decrease for 10 consecutive epochs during training, the process was terminated. The optimizer, activation function, and asymmetric loss function used in RGTsite were Adam, ReLU, and Asymmetric Loss (ASL) [42] (Formula (12)), respectively. The learning rate, weight decay, numerical stability parameters, and momentum coefficient were set to 0.0001, 5 × 10^−5^, 1 × 10^−7^, and (0.9, 0.999), respectively. The batch size and dropout rate were set to 1 and 0.3, respectively.

[eqn]

Among them, $[eqn]$ represents the model’s prediction probability, $[eqn]$ is the focusing parameter, $[eqn]$ is the offset-adjusted probability, and $[eqn]$ and $[eqn]$ denote the positive and negative loss components, respectively.

2.6. Evaluation Metrics

To evaluate the model’s performance, we used sensitivity (Sen), specificity (Spe), accuracy (Acc), precision (Pre), Matthews Correlation Coefficient (MCC), and F1-score (F1).

[eqn]

[eqn]

[eqn]

[eqn]

[eqn]

[eqn]

Among them, $[eqn]$ , $[eqn]$ , $[eqn]$ , and $[eqn]$ represent true positives, false negatives, true negatives, and false positives, respectively. Both F1 and MCC provide a more accurate assessment of the model’s performance on imbalanced datasets.

3. Results

3.1. Performance Comparison Between RGTsite and State-of-the-Art Methods

In this section, we evaluated the performance of RGTsite by comparing it with existing state-of-the-art methods for predicting protein-ATP binding sites, both sequence-based and structure-based.

Sequence-based methods included NsitePred [43], TargetS [44], ATPseq [45], DELIA (seq) [46], DeepATPseq [47], and E2EATP (388, 1930) [12]. Among these, DELIA (seq) referred to DELIA trained only on sequence-based features, and E2EATP (388, 1930) represented the versions of E2EATP trained on PATP-388 and PATP-1930, respectively. We compared the performance of these methods against RGTsite on the independent test set PATP-41. To provide a comprehensive evaluation for RGTsite, we also trained RGTsite (388) and RGTsite (1930) on PATP-388 and PATP-1930, respectively. Experimental results (Table 3) showed that RGTsite (388) outperformed NsitePred, TargetS, ATPseq, DELIA (seq), DeepATPseq, and E2EATP (388) across all evaluation metrics. Notably, RGTsite (388) exhibited improvements in key metrics, achieving F1 (70.16%) and MCC (0.699), surpassing the best-performing methods by 4.14% and 6.72%, respectively, compared to their optimal values (66.02%, 0.655). RGTsite (1930) also outperformed all comparison methods, with the exception of a slight decrease in Spe by 0.18% and Pre by 0.56% compared to ATPseq. In particular, RGTsite (1930) improved F1 (68.21%) and MCC (0.668) by 3.1% and 5.24%, respectively. The comparison results indicated that incorporating 3D structural information of the target enhanced the performance of drug-target binding sites prediction, when compared to sequence-based methods alone.

Structure-based methods included COACH (ITA, AF2) [48], ATPbind (ITA, AF2) [45], and DELIA (ITA, AF2), where ITA and AF2 indicated that the target 3D structures required for these methods were derived from the structure prediction tools I-TASSER [49] and AlphaFold2 [41], respectively. The independent test set used was PATP-41. In the comparative experiment with structure-based methods, all target 3D structures for RGTsite were sourced from the predicted results of AlphaFold2, without using experimentally determined 3D structures. This approach was also designed to provide services for targets lacking experimental 3D structures. Experimental results (Table 4) revealed that RGTsite (388) outperformed existing structure-based state-of-the-art methods across all evaluation metrices, except for Sen. RGTsite (388) excelled in F1 and MCC metrices, showing improvements of 3.1% and 5.27%, respectively, compared to the best-performing method, ATPbind (AF2). Overall, RGTsite’s performance surpassed that of these methods. Specifically, RGTsite (388) achieved an F1 score of 70.16%, and an MCC value of 0.699. RGTsite (1930) showed a 2.67% decrease in Sen compared to COACH (AF2), and a 0.1% in Spe and a 0.05% decrease in Pre, compared to ATPbind (AF2), but outperformed all state-of-the-art methods across other evaluation metrics, particularly in F1 (71.31%) and MCC (0.703), with improvements of 4.25% and 5.87%, respectively, over the best-performing method ATPbind (AF2). Comparisons with state-of-the-art structure-based methods had demonstrated that RGTsite was highly effective in extracting features from vertices (amino acid residues) and in making more accurate predictions regarding binding sites. Additionally, the advantage of applying residue GTN in RGTsite had been confirmed.

To further comprehensively evaluate the performance of RGTsite, we compared it with a series of state-of-the-art methods using the independent test set PATP-NW30-103. Both E2EATP (NW30) and RGTsite (NW30) were trained on PATP-NW30-1861, while the experimental results of other methods were cited from E2EATP. As shown in the results (Table 5), RGTsite (NW30) achieved the highest F1 (60.02%) and MCC (0.595) among all methods, with improvements of 2.63% and 5.12%, respectively, compared to the top-performing method. In addition, RGTsite’s performance across the remaining evaluation metrics also reached the levels comparable to the best values. Overall, RGTsite’s excellent performance on PATP-NW30 demonstrated that the residual GTN-based approach effectively extracted the key features influencing vertex (amino acid residue) classification across different datasets. This highlighted RGTsite’s strong vertex classification capabilities, making it suitable for analysis tasks on complex datasets.

3.2. Impact of Three Types of Features That Make up the Vertex Features on the Model’s Performance

In this study, we integrated the global sequence features output by ProtT5, the local sequence features extracted by a residual 1D-CNN, and the physicochemical features of amino acid residues as the vertex features within the residual GTN. To assess the effectiveness of these three types of features, we combined them and constructed six different variant models (Variant 1–Variant 6). Using PATP-1930 as the training set and PATP-41 as the independent test set, we conducted comparative experiments between RGTsite and six variant models. Experimental results (Table 6) revealed that while RGTsite, which combined all three types of features, achieved comparable or slightly inferior performance compared to the variant models in terms of Sen, Spe, Acc, and Pre, it outperformed all variant models in F1 and MCC. Specifically, RGTsite achieved F1 (71.31%) and MCC (0.703), which were 1.24% and 0.57% higher, respectively, than the corresponding best results obtained by the variant models. These findings clearly showed that combining the three types of features had a positive impact on enhancing the overall performance of the model.

3.3. Residual 1D-CNN Helps Improve the Model’s Performance

Here, residual 1D-CNN was employed to extract the local sequence features. However, there were various models available for extracting the sequence features of target, such as 1D-CNN, recurrent neural networks (RNN), and transformer. Given that transformer was already incorporated in the residual GTN, we aimed to avoid redundant feature extraction and therefore only conducted comparative experiments with RNN and 1D-CNN to validate the effectiveness of residual 1D-CNN. It was worth noting that we adapted an improved three-layer residual 1D-CNN architecture which incorporated two residual connections built upon 1D-CNN. Under the experimental conditions of using PATP-1930 and PATP-41 as the training and test sets, respectively, experimental results (Table 7) showed that RNN outperformed 1D-CNN in terms of Sen, but 1D-CNN showed superior performance across the remaining evaluation metrics. This outcome may be attributed to the redundancy between the global sequence features extracted by RNN and those output by ProtT5, while the local sequence features extracted by 1D-CNN complement the global features from ProtT5 more effectively. Although 1D-CNN demonstrated better performance compared to RNN, the residual 1D-CNN introduced in RGTsite outperformed 1D-CNN in two key evaluation metrics: F1 and MCC. This suggested that the local sequence features extracted by the residual 1D-CNN played a significant role in improving the performance of drug-target binding sites prediction model.

3.4. Residual Graph Transformer Network Is Beneficial for Identifying Binding Sites

Residual GTN served as the core architecture in RGTsite, capturing the vertex feature information by inputting a target chain graph, vertex features, and edge features. To verify the effectiveness of this core architecture, we conducted comparative experiments using widely adopted graph neural network models, including graph convolutional network (GCN), graph attention network (GAT), graph isomorphism network (GIN), GraphSAGE, and GTN, on the training set PATP-1930 and test set PATP-41. Experimental results (Table 8) revealed that GCN and GAT performed relatively poorly, while GIN, GraphSAGE, and GTN all achieved an F1 above 60% and an MCC above 0.6, with the other evaluation metrics being comparable across these models. Notably, GTN outperformed the others in both F1 and MCC, achieving scores of 68.97% and 0.692, respectively. Although GTN exhibited strong performance, it did not incorporate edge features. In contrast, RGTsite, which leveraged residual GTN as its core architecture, introduced the edge features through residual connections, thereby enriching the vertex-feature extraction process. Experimental results confirmed that RGTsite not only performed on par with or better than GTN in four evaluation metrics: Sen, Spe, Acc, and Pre, but also surpassed GTN in F1 and MCC. Specifically, RGTsite showed improvements of 2.34% in F1 and 1.59% in MCC.

3.5. Case Study

To evaluate the performance of RGTsite in practical applications, we selected 20 and 149 target chains from the dataset used by the state-of-the-art method ATP-Deep [50] and constructed two external validation datasets PATP_20 (Table 9) and PATP_149, respectively. All chains in these datasets were excluded from any benchmark sets used in this study. For each chain, PATP_20 contained the following information: target_chain_name, the number of non-binding sites (True_label_0), and the number of binding sites (True_label_1). We processed these chains one by one through RGTsite and recorded the predicted number of non-binding sites (Predicted_label_0), the predicted number of binding sites (Predicted_label_1), the F1, and the MCC. Experimental results (Table 9) showed that RGTsite achieved an F1 of over 80% for 80% of the target chains, and an MCC of over 0.8 for 75% of the target chains. Moreover, the mean F1 and MCC values of all target chains were 86.72% and 0.867, respectively, both of which outperformed RGTsite’s performance on its benchmark test sets. Furthermore, we visualized the binding sites identification by RGTsite on three target chains (Figure 2). By examining the colors of the identified amino acid residues (green: True Negative (TN), red: True Positive (TP), blue: False Positive (FP), and purple: False Negative (FN)) around ATP (yellow), it was intuitively clear that the true binding sites (red) of ATP had been accurately identified. RGTsite’s outstanding performance on PATP_20 demonstrated its reliability as a drug-target binding site prediction tool in practical applications.

We also assessed the practical application capability of RGTsite using PATP_149. Experimental results (Table 10) showed that RGTsite performed well across all evaluation metrics. RGTsite’s outstanding performance on PATP_20 and PATP_149 demonstrated its reliability as a drug-target binding sites prediction tool in practical applications.

Additionally, to further confirm that RGTsite can predict the binding sites of small molecules beyond ATP, we trained RGTsite on the Pro_Train_335 dataset from the advanced method DeepProSite and assessed its performance with the Pro_Test_60 and Pro_Test_315 test sets. Experimental results (Table 11) showed that RGTsite can predict binding sites for small molecules other than ATP, while also exhibiting good performance.

3.6. Interpretability Analysis of RGTsite

To explain the effectiveness of RGTsite, we saved models trained on PATP-1930 at epochs 20, 40, 60, and 83. We then randomly selected two target chains 1OJL_E and 2Z08_A from PATP_20 and ran these models on them. For each model, we counted the number of amino acid residues classified as TN, FN, FP, and TP, as well as the corresponding F1 and MCC values (Table 12). Statistical results showed that with an increasing number of training epochs, RGTsite gradually reduced the misidentification of FP amino acid residues and improved the accuracy of true-binding sites identification. In addition to the statistical data, we visualized the identification results of RGTsite for the target chains after training at epochs 20, 40, 60, and 83 (Figure 3). Among them, gray-white, green, yellow, and red represented amino acid residues classified as TN, TP, FP, and FN, respectively. From the visualization results, it was evident that after 20 epochs of training, RGTsite had mostly identified the binding sites of TP class but still misclassified some non-binding sites as binding sites. As the training epochs increased, the misidentified FP amino acid residues (yellow) gradually decreased, and both the F1 and MCC values improved, ultimately resulting in the accurate identification of binding sites for 1OJL-E and 2Z08-A. Through a comprehensive analysis of both the statistical data and visualization results across different training epochs, it was clear that RGTsite exhibited high effectiveness in predicting drug-target binding sites.

4. Conclusions

RGTsite was a deep learning-based tool for identifying drug-target binding sites, employing a residual 1D-CNN, the pre-trained model ProtT5, and a residual GTN to extract sequence and structural features from targets. Experimental results on multiple benchmark datasets showed that RGTsite outperformed existing state-of-the-art methods in two key evaluation metrics: F1 and MCC. Practical application results and interpretability analysis confirmed that RGTsite was an effective and reliable tool for drug-target binding sites identification. Additionally, RGTsite is also effective for identifying binding sites of small molecules beyond ATP.

In the future, we plan to further improve the performance of drug-target binding sites identification in the following five aspects: (1) After decomposing the target into target chains, all amino acid residues in some chains do not participate in binding sites formation. These chains can be preprocessed to reduce the number of negative samples, thereby alleviating the class imbalance in the dataset; (2) RGTsite is currently a single task model focused solely on predicting drug-target binding sites. We will explore to introduce a drug-target binding affinity prediction task, fusing the deep features extracted from both tasks to create a mutually reinforcing multi-task learning model; (3) While RGTsite has achieved performance improvement in drug-target binding sites identification through the residual GTN architecture, its F1 and MCC are just exceeded 70%, still falling short of user expectations. Continued advancements in deep learning techniques are needed to design innovative drug-target binding sites identification models that can further enhance the prediction performance; (4) We will incorporate the effects of conformational changes on binding site variations to broaden the application scope of RGTsite; (5) Exploring the similarities between different binding pockets to enhance RGTsite’s ability to learn key implicit features.

Bibliography50

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Zhang O. Zhang J. Jin J. Zhang X. Hu R. Shen C. Cao H. Du H. Kang Y. Deng Y. Res Gen Is a Pocket-Aware 3D Molecular Generation Model Based on Parallel Multiscale Modelling Nat. Mach. Intell.202351020103010.1038/s 42256-023-00712-7 · doi ↗
2Vemula D. Jayasurya P. Sushmitha V. Kumar Y.N. Bhandari V. CADD, AI and ML in Drug Discovery: A Comprehensive Review Eur. J. Pharm. Sci.202318110632410.1016/j.ejps.2022.10632436347444 · doi ↗ · pubmed ↗
3Xia Y. Pan X. Shen H.-B. A Comprehensive Survey on Protein-Ligand Binding Site Prediction Curr. Opin. Struct. Biol.20248610279310.1016/j.sbi.2024.10279338447285 · doi ↗ · pubmed ↗
4Jakubec D. Skoda P. Krivak R. Novotny M. Hoksza D. Prank Web 3: Accelerated Ligand-Binding Site Predictions for Experimental and Modelled Protein Structures Nucleic Acids Res.202250 W 593W 59710.1093/nar/gkac 38935609995 PMC 10353840 · doi ↗ · pubmed ↗
5Mc Greig J.E. Uri H. Antczak M. Sternberg M.J.E. Michaelis M. Wass M.N. 3D Ligand Site: Structure-Based Prediction of Protein–Ligand Binding Sites Nucleic Acids Res.202250 W 13W 2010.1093/nar/gkac 25035412635 PMC 9252821 · doi ↗ · pubmed ↗
6Nazem F. Ghasemi F. Fassihi A. Rasti R. Dehnavi A.M. A GU-Net-Based Architecture Predicting Ligand–Protein-Binding Atoms J. Med. Signals Sens.20231311010.4103/jmss.jmss_142_2137292445 PMC 10246592 · doi ↗ · pubmed ↗
7Piazza I. Beaton N. Bruderer R. Knobloch T. Barbisan C. Chandat L. Sudau A. Siepe I. Rinner O. De Souza N. A Machine Learning-Based Chemoproteomic Approach to Identify Drug Targets and Binding Sites in Complex Proteomes Nat. Commun.202011420010.1038/s 41467-020-18071-x 32826910 PMC 7442650 · doi ↗ · pubmed ↗
8Laurie A.T.R. Jackson R.M. Q-Site Finder: An Energy-Based Method for the Prediction of Protein-Ligand Binding Sites Bioinformatics 2005211908191610.1093/bioinformatics/bti 31515701681 · doi ↗ · pubmed ↗