Mixing features of transcription factors and genes enable accurate prediction of gene regulation relationships for unknown transcription factors
Risa Okubo, Takashi Morikura, Yusuke Hiki, Yuta Tokuoka, Tetsuya J Kobayashi, Takahiro G Yamada, Akira Funahashi

TL;DR
A new deep learning model predicts gene regulation by unknown transcription factors by combining amino acid and nucleotide sequence features.
Contribution
The novel model GReNIMJA can predict gene regulation for unknown transcription factors using mixed sequence features.
Findings
The model achieved 84.4% accuracy for known TFs and 68.5% for unknown TFs.
It outperforms conventional models in predicting regulatory relationships for unknown TFs.
Abstract
Identifying regulatory relationships between transcription factors (TFs) and genes is essential to understand diverse biological phenomena related to gene expression. Recently, deep learning–based models to predict TFs that bind to genes from nucleotide sequences of the target genes have been developed, yet these models are trained to predict known TFs only. Here, we developed a deep learning model, GReNIMJA (Gene Regulatory Network Inference by Mixing and Jointing features of Amino acid and nucleotide sequences), to predict gene regulation even by unknown TFs. Our model is designed to mix the features of the TF amino acid sequences and nucleotide sequences of the target genes using a 2D Long Short-Term Memory architecture and to perform binary classification with the aim of determining the presence or absence of a regulatory relationship. By explicitly modeling interactions between TFs…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5| Accuracy | AUROC | AUPR | F-measure | Precision | Recall | MCC | |
|---|---|---|---|---|---|---|---|
| PWMScan | 0.524 | – | – | 0.191 | 0.595 | 0.114 | 0.065 |
| SVM | 0.581 | 0.666 | 0.645 | 0.410 | 0.693 | 0.291 | 0.136 |
| LR | 0.710 | 0.763 | 0.734 | 0.720 | 0.697 | 0.745 | 0.422 |
| XGBoost | 0.822 | 0.900 | 0.893 | 0.829 | 0.795 |
| 0.646 |
| DeepRAM | 0.739 | 0.809 | 0.784 | 0.748 | 0.729 | 0.768 | 0.479 |
| TBiNet | 0.833 | 0.915 | 0.913 | 0.837 | 0.819 | 0.856 | 0.667 |
| CacPred | 0.687 | 0.635 | 0.564 | 0.530 | 0.813 | 0.641 | 0.214 |
| Ours |
|
|
|
|
| 0.863 |
|
| Accuracy | AUROC | AUPR | |
|---|---|---|---|
| PWMScan | 0.521 | - | - |
| SVM | 0.544 | 0.644 | 0.662 |
| LR | 0.612 | 0.669 | 0.677 |
| XGBoost | 0.674 | 0.766 | 0.764 |
| DeepRAM | 0.639 | 0.690 | 0.663 |
| TBiNet | 0.658 | 0.782 | 0.790 |
| CacPred | 0.697 | 0.619 | 0.586 |
| Ours |
|
|
|
| Accession number | Domain name | Relative frequency of filters |
|---|---|---|
| PS00420 | SRCR domain signature and profile | 0.550 |
| PS00223 | Annexin repeat signature and profile | 0.150 |
| PS00029 | Leucine zipper pattern | 0.142 |
| PS00969 | Antenna complexes alpha and beta subunits signatures | 0.065 |
| PS00266 | Somatotropin, prolactin and related hormones signatures | 0.019 |
| PS00244 | Photosynthetic reaction center proteins signature | 0.015 |
| PS00400 | LBP/BPI/CETP family signature | 0.012 |
| PS00909 | Mandelate racemase/muconate lactonizing enzyme family signatures | 0.012 |
| PS01133 | Uncharacterized protein family UPF0017 signature | 0.008 |
| PS01361 | Zinc finger Dof-type signature | 0.004 |
| PS60026 | Ergtoxin family signature | 0.004 |
| AC | AG | AT | CG | CT | GT |
|---|---|---|---|---|---|
| 0.019 | 0.004 |
| 0.015 | 0.004 | 0.000 |
| Accuracy | AUROC | AUPR | |
|---|---|---|---|
| SVM | 0.548 | 0.600 | 0.616 |
| LR | 0.559 | 0.620 | 0.626 |
| XGBoost | 0.651 | 0.683 | 0.687 |
| Ours |
|
|
|
- —Core Research for Evolutional Science and Technology10.13039/501100003382
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Chromatin Dynamics · Machine Learning in Bioinformatics · Bioinformatics and Genomic Networks
Introduction
Living organisms sustain their lives through the central dogma, where DNA is transcribed into messenger RNA (mRNA), followed by the synthesis of proteins with amino acid sequences corresponding to the mRNA. During transcription, a transcription factor (TF) specifically binds to a transcription regulatory region and regulates gene expression near this region [1]. The complex relationships between TFs and genes form a gene regulatory network (GRN). Such networks are crucial in various biological phenomena, such as development [2, 3], stress responses [4, 5], and cancers [6, 7]. GRNs serve as blueprints for sustaining the biological phenomena, and therefore identifying GRNs is vital for understanding the various biological phenomena.
To identify GRNs, an experimental method called ChIP-seq has been widely used across various species and cell types [8–11]. ChIP-seq comprehensively detects the nucleotide sequences of genes where TFs bind, but it requires specific antibodies for each TF, making it labor-intensive and time-consuming as the TF number increases. To reduce experimental costs, statistical methods that use sequence motifs were developed [12]. The sequence motifs represent the occurrence probability of nucleotide sequences that are bound by specific TFs. Various sequence motifs have been registered in public databases such as JASPAR [13] and TRANSFAC [14]. Using these databases, statistical methods enable us to estimate the gene regulatory relationships for well-studied species like human and mouse. However, sequence motifs as such are insufficient for the accurate prediction of gene regulatory relationships because other factors, such as GC content and DNA shape, influence these relationships [15, 16]. In addition, these methods are fundamentally unable to predict gene regulatory relationships for unknown sequence motifs that have not been detected experimentally.
Deep learning–based discriminative models directly classify specific TFs from nucleotide sequences of the target genes [17, 18]. These models extract features from a nucleotide sequence with consideration of factors such as GC content and DNA shape, and predict the TFs that bind to that sequence [17, 18]. These models can predict the TFs that bind to genes even when the sequence motif they recognize is unknown. However, these models are trained to predict the specific TFs from genes, and they are thus unable to predict gene regulatory relationships with unknown TFs. For identifying the GRNs composed of diverse TFs and genes, a deep learning-based model that can predict gene regulatory relationships even with unknown TFs is strongly required.
In this study we developed a deep learning–based model, GReNIMJA (Gene Regulatory Network Inference by Mixing and Jointing features of Amino acid and nucleotide sequences). A key point of this study is that GReNIMJA was designed, not to predict the specific TFs from genes, but to predict whether the regulatory relationships exist or not from both of the amino acid sequences of TFs and nucleotide sequences of regulatory region of target genes. Our model extracted features from both the TF amino acid sequences and target gene sequences, mixed these features using a 2D Long Short-Term Memory (LSTM) architecture, and performed binary classification to predict the presence or absence of regulatory relationships. The accurate prediction of regulatory relationships even with unknown TFs by our model extends the frontier of the identification of unknown GRNs and may contribute to a deeper understanding of diverse life phenomena.
Materials and methods
Dataset
To construct a dataset of TFs and their target genes, we collected the experimental data stored in the public DoRothEA [19] and Harmonizome [20] database. DoRothEA stores human and mouse gene regulatory relationships that were detected experimentally and statistically; it also groups these relationships by confidence level, which is assigned on a five-point scale ranging from A to E [19]. Regulatory relationships supported by published studies are assigned higher confidence levels, whereas those inferred only from gene expression patterns or sequence motifs are assigned lower confidence levels, reflecting the lack of direct evidence. Harmonizome stores human gene regulatory relationships that were detected experimentally [20]. We collected the gene regulation data and obtained 2674 639 pairs of human TFs and their target genes. We then obtained the TF amino acid sequences and their target gene sequences (upstream 1000 bp of each gene) from the public genome database GenBank [21] (accession number for the human genome: GCF_000001405.39) and constructed pairs of TF amino acid sequences and target gene sequences. We removed pairs that were duplicated in the DoRothEA and Harmonizome databases, contained unidentified residues or bases, or were assigned to the lowest confidence level E in DoRothEA, and obtained 1806 075 pairs (301 915 from DoRothEA and 1504 160 from Harmonizome). The constructed dataset contained 697 TFs. We assigned positive labels to the pairs for which the gene regulatory relationship was indicated as presence in the database (Supplementary Fig. S1), then shuffled the TFs and genes and assigned negative labels to the shuffled pairs for which the gene regulatory relationship was not detected. The same number of nucleotide sequences as the number of genes regulated by TFs was randomly selected from genes other than the regulated genes.
Our gene regulation dataset (3383 776 pairs) consisted of 1806 075 positive pairs and 1577 701 negative pairs; 316 298 pairs were excluded and used as test data. The remaining dataset was divided into training data and validation data (five-fold cross validation). At each fold, the model with the lowest value of the loss function at each epoch was selected, followed by selection of the best model in the whole dataset, and its performance was evaluated using the test data.
TFs with highly similar sequences should not be included in the test data. To clarify the degree of sequence similarity in our dataset, we evaluated sequence similarity between TFs in the training and test datasets. In this analysis, we calculated mean alignment scores using Matrix to Improve Quality in Similarity Search (MIQS) [22] between TF pairs within the training data, within the test data, and between the training and test data. Max identity score is the maximum value of the ratio of matched bases (percent identity) in the alignment between a query sequence and a set of reference sequences; we calculated this score between TFs in the test dataset (query sequences) and in the training dataset (reference sequences).
Embedding of amino acid sequences and nucleotide sequences
To convert amino acid sequences and nucleotide sequences into numerical vector representations, each sequence was converted into a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} k\end{document} -mer representation (Supplementary Fig. S2A), which was fed to each specifically pre-trained word2vec model [23] for human amino acid sequence and nucleotide sequences(Supplementary Fig. S2B). The sequence length \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} k\end{document} was 4 for amino acid sequences [24] and 5 for nucleotide sequences [25].
Word2vec learns mapping from strings to numerical vectors while considering the semantic information of words within sentences [23]. Word2vec is useful for predicting biological semantics such as protein family attributes and protein–protein interactions [26–28]. To predict words surrounding the target word, we pre-trained word2vec with the skip-gram algorithm. The word2vec architecture is constructed as a neural network with input, middle, and output layers that are fully connected (FC). The softmax function is applied to the output layer, and the parameters of the model are updated by calculating the total cross-entropy loss between the predicted vector and the ground-truth vector of surrounding words. The number of dimensions \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} d\end{document} of the numerical vectors in the output layer was set to 100 for amino acid sequences [24] and 50 for nucleotide sequences [25]. The number of epochs for training the word2vec model was set to 10, and the maximum distance between the predicted surrounding words and the target word was set to 10 words as in the previous studies [28]. We trained two word2vec models, one with amino acid sequences and the other with nucleotide sequences from GenBank [21] (accession number: GCF_000001405.39). These models do not learn the presence or absence of gene regulatory relationships, but just function as pre-trained models that performs embedding, so we trained the embedding model on all of the above sequence data registered in GenBank.
To obtain a numerical matrix, we repeatedly input the sub-sequences of the \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} k\end{document} -mer representation into each pre-trained word2vec model, and combined the numerical vectors from each \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} k\end{document} -mer representation (Supplementary Fig. S2C).
Overview of the proposed model
To accurately predict gene regulatory relationships even with unknown TFs, we developed a deep learning-based model, GReNIMJA, which: (i) extracted each feature from numerical matrices with the amino acid sequences and nucleotide sequences embedded using pre-trained word2vec; and (ii) mixed these features using a 2D LSTM architecture (Fig. 1). It used binary classification to predict the presence or absence of regulatory relationships.
Overview of the proposed model architecture. The amino acid sequence of a TF and the nucleotide sequence of its target gene are converted into \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \end{document}-mer representations. Pre-trained word2vec models are used to embed these representations into numerical matrices. The convolution (Conv) and pooling layers are used to extract the features of this matrix. To align the dimensions of the extracted features, the features are input to the convolution layer with a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \end{document} kernel. The aligned features are mixed in 2D LSTM. The output from the 2D LSTM is input to the FC layers with sigmoid function. Using the feature output from the FC layers, our model classifies the regulatory relationship between the TF and gene as positive (present) or negative (absent).
(i) Each of the embedded representations was input to the convolution layer and pooling layer with a dropout process that randomly set the parameter values to zero. Because the dimensions of the features extracted from the amino acid sequences and nucleotide sequences were different, we performed convolution processing using a \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} 1 \times 1\end{document} filter for each feature to align the number of dimensions. (ii) The 2D LSTM model enables mixing different series of data; it was used to mix both features [29]. The recursive memory unit structure of LSTM aimed to memorize long sequences by updating internal states; it had an input gate, an output gate, a memory cell to store the state, and an oblivion gate to determine whether to use the features stored in the memory cell. Although LSTM is able to store sequences for long periods, it handles only one kind of series as input. We incorporated 2D LSTM into the proposed model to handle two series as inputs: the TF amino acid sequence and the gene nucleotide sequences (Supplementary Fig. S3). To determine how to mix the two series of sequences, we added a lambda gate to the conventional LSTM and obtained the memory unit structure of 2D LSTM. The features mixed by the 2D LSTM were input into the two FC layers with sigmoid function to predict the presence or absence of the gene regulatory relationship.
Training procedure
The number of epochs was fixed at 100. Mini-batch size was set to 1024 for the training dataset and 512 for the validation dataset. As the loss function, a binary cross-entropy loss was used. Adam [30] was used as the optimizer; the initial learning rate for Adam was set to \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} lr=0.00001\end{document} . For model training, we used NVIDIA V100 32GB PCIe (FP32 14 TFLOPS) and NVIDIA A100 40GB PCIe (FP32 19.5 TFLOPS).
Hyperparameters of GReNIMJA were tuned using Optuna [31], which implemented Bayesian optimization with the Tree-structured Parzen Estimator algorithm. Tuned hyperparameters are described in Supplementary Table S1, and the optimization history is provided in Supplementary Fig. S4. Hyperparameter importance in the Bayesian optimization was calculated using the Mean Decrease Impurity (MDI) evaluator, in which a random forest regression model was trained to predict the objective values (binary cross-entropy loss in the validation dataset). The MDI was then calculated using the trained model. MDI quantifies the total reduction in the impurity for regression contributed by each hyperparameter, accumulated over all splits in the forest where the hyperparameter is used, and averaged across all trees.
Evaluation of model performance for known TFs
Using the test dataset, we evaluated the prediction with the Accuracy, Area Under the Receiver Operating Characteristic Curve (AUROC), Area Under the Precision-Recall curve (AUPR), F-measure, and Matthews Correlation Coefficient (MCC) metrics as follows:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{eqnarray*} \mathrm{Accuracy} = \frac{TP + TN}{TP + FP + TN + FN} \end{eqnarray*}\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{eqnarray*} \mathrm{AUROC} &=& \int _{0}^{1} f(\mathrm{FPR},\mathrm{TPR}) \ d \ \mathrm{FPR} \;where\; \\\mathrm{FPR} &:=& \frac{FP}{FP+TN}, \mathrm{TPR} := \frac{TP}{TP+FN} \end{eqnarray*}\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{eqnarray*} \mathrm{AUPR} &=& \int _{0}^{1} g(\mathrm{Recall},\mathrm{Precision}) \ d \ \mathrm{Recall} \;where\; \\\mathrm{Recall} &:=& \frac{TP}{TP + FN}, \mathrm{Precision} := \frac{TP}{TP + FP} \end{eqnarray*}\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{eqnarray*} \mathrm{F-measure} &=& \frac{2\cdot Precision \cdot Recall}{Precision + Recall} \;where\; \\\mathrm{Recall} &:=& \frac{TP}{TP + FN}, \mathrm{Precision} := \frac{TP}{TP + FP} \end{eqnarray*}\end{document} \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{eqnarray*} \mathrm{MCC} = \frac{(TP\times TN) - (FP \times FN)}{\sqrt{(TP + FP)(TP + FN)(TN + FP)(TN + FN)}} \\ \end{eqnarray*}\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} TP\end{document} , true positive; \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} TN\end{document} , true negative; \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} FP\end{document} , false positive; \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} FN\end{document} , false negative; FPR, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} FP\end{document} rate; and TPR, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} TP\end{document} rate. Precision and Recall evaluate the number of false positives and false negatives, respectively. AUROC represents the area of the ROC curve \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} f(\mathrm{FPR},\mathrm{TPR})\end{document} and evaluates the robustness of the model by adjusting the balance between FPR and TPR. The function \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} f\end{document} represents the curve connecting the sample points of FPR and TPR. AUPR represents the area of the PR curve \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} g(\mathrm{Recall},\mathrm{Precision})\end{document} and evaluates the robustness of the model by adjusting the balance between Precision and Recall. The function \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} g\end{document} represents the curve connecting the sample points of Precision and Recall. The F-measure evaluates the balance between false positives and false negatives. MCC evaluates the accuracy of a model in a binary classification task by adjusting the balance between positive and negative cases when the number of one class is too small. When all predictions are wrong, MCC is −1; when predictions are made at random, MCC is 0; and when all predictions are correct, MCC is 1. We calculated the average of the evaluation metrics for all test data as a representative value. For Accuracy, AUROC, and AUPR [32], we also calculated the average of the evaluation metrics for each TF in the test data.
We compared the performance of GReNIMJA with that of a traditional statistical analysis method (PWMScan [12]), classical machine learning methods (Support Vector Machine (SVM) [33], Logistic Regression (LR) [34], and eXtreme Gradient Boosting (XGBoost) [35]), a conventional deep learning method (DeepRAM [32], a multiclass classification model (TBiNet [36]), and a cascade convolutional neural network model (CacPred [37]) on the same dataset.
PWMScan predicts TF binding from motifs in the target nucleotide sequences. The motifs were obtained from the JASPAR database [13]; it stores position frequency matrices (PFMs) that represent the probability of nucleotide sequence binding to a known TF. The PFM was converted into a position probability matrix (PPM). The PPM was converted into a position weight matrix (PWM) that corrected the bias in the occurrence probability of nucleotide sequences across the whole genome (Supplementary Fig. S5).
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{eqnarray*} \mathrm{PWM}_{i, j} = \log _2 \frac{\mathrm{PPM}_{i, j} + c/I}{b_i + c} \end{eqnarray*}\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mathrm{PPM}{i, j}\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mathrm{PWM}{i, j}\end{document} represent the occurrence probability of PPM and PWM at each position \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} {i, j}\end{document} . Parameter \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} b_i\end{document} represents the occurrence probability of the nucleotide sequence in the whole genome. Parameter \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} c\end{document} is a constant value introduced for computational stability, and was set to 1 [38]. Parameter \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} I\end{document} represents the sequence length, and was set to 4 for nucleotide sequences and 20 for amino acid sequences. After converting the motif into the PWM, the binding score between PWM and target nucleotide sequence \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} w\end{document} was calculated using the following equation.
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{eqnarray*} \text{binding score} = \sum _{i = 0}^{m-1} \mathrm{PWM} [i , j] \ where \ j = w[i] \end{eqnarray*}\end{document}where \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mathrm{PWM} [i , j]\end{document} is the occurrence probability value at the row index \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} i\end{document} and column index \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} j\end{document} ; the latter corresponds to the nucleotide at \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} i\end{document} in the nucleotide sequence \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} w\end{document} . Parameter \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} m\end{document} is the number of rows in the PWM. When the binding score exceeds a predefined threshold, PWMScan predicts the presence of TF binding to the target nucleotide sequence. The threshold was determined by the significance level (P-value), which was predefined on the basis of the binding score distribution. Careful selection of this threshold is required to balance false-positive and false-negative predictions. In this study, the P-value was optimized by Bayesian optimization in SigOpt (https://sigopt.com). For PWMScan performance evaluation, TFs not included in the JASPAR database were excluded from the test dataset.
SVM learns a binary classification to maximize the distance between the decision boundary and each data point [33]. Generalized linear model LR follows a Bernoulli distribution and learns a binary classification based on the probability that a data point belongs to a specific class [34]. XGBoost (a gradient boosting decision tree model) is an ensemble model of decision trees that learns a binary classification [35]. In these models, the length of the input numerical vector must be fixed. In accordance with [39], we fixed the length of input numerical vector by obtaining the weighted average of each numerical vector that was converted from \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} k\end{document} -mer representation by using the pre-trained word2vec model. Values of term frequency-inverse document frequency consider the importance of each \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} k\end{document} -mer [39] and were used as weights for averaging.
DeepRAM [32] classifies whether a specific TF is a regulatory factor or not for the input nucleotide sequences. DeepRAM architecture consists of a convolution layer and LSTM. Because DeepRAM learns a binary classification task, training multiple models for each TF is required; because of computational constraints, we randomly selected 44 TFs from the 697 TFs in the dataset, and trained each model for each TF. TBiNet [36] predicts the presence or absence of the gene regulatory relationship for each TF from the input nucleotide sequences. TBiNet architecture consists of convolution layers, LSTM, and an attention layer. Because TBiNet learns a multi-class classification task, TBiNet cannot predict TFs outside of the training dataset. We trained TBiNet on TFs from the test dataset. The hyperparameters for each TF in each DeepRAM model were optimized by using Bayesian optimization with the Tree-structured Parzen Estimator algorithm described in [32]. We used the hyperparameters for TBiNet described in [36]. CacPred [37] takes forward- and reverse-complementary gene sequences as input and predicts the probability of binding to a target TF. Its architecture consists of a convolution layer followed by a dense layer. Similar to DeepRAM and TBiNet, CacPred requires training a separate model for each TF.
Evaluation of model performance for unknown TFs
In this study, we defined the unknown TFs as those that were not included in the training dataset. To ensure that TFs in the test dataset are completely unseen to the trained model, we carefully constructed the datasets so that no TFs overlapped among the training, validation, and test datasets. We first randomly selected a subset of TFs from the full set and removed in advance all TF–gene pairs associated with them that were used as test data. The remaining subset of TFs was further partitioned, and the TF–gene pairs in each subset were assigned to the training and validation datasets. We divided 697 TFs as follows: training, 597; validation, 30; test, 70. We trained the model with 50 epochs and used the Accuracy, AUROC, and AUPR metrics to evaluate its performance. We also compared the performance of our model with that of classical machine learning methods using the same datasets. A traditional statistical analysis method and deep learning methods were excluded from the comparison because they cannot predict regulatory relationships for unknown TFs.
Evaluation of model robustness
The performance of deep learning models generally depends on the number of datapoints; when this number is small, the performance is often degraded [40, 41]. To evaluate the robustness of our model for the number of datapoints for TFs, we analyzed the accuracy of prediction for that number.
The performance of LSTM can also be degraded for extremely long sequences [42]. Although most TFs in the dataset had around 500 amino acids, the longest one had 4005 (Supplementary Fig. S6). We analyzed the accuracy of prediction for each sequence length. To quantify the effect of the length of the TF amino acid sequence on prediction bias, we analyzed the true positive, true negative, false positive, and false negative predictions by our model for each sequence length.
Feature analysis of the trained model
To interpret GReNIMJA, we analyzed features in (i) the convolution layer and (ii) the last output layer of the model. (i) We analyzed the activated amino acid sequences and nucleotide sequences in our model. The analysis algorithm was based on that used in DeepBind [17]. Briefly, all sequences in the training dataset were input into the convolution layer, and the feature vectors \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} u\end{document} were obtained. By adding all features and subtracting the mean value of the feature vector \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \mathrm{u}{mean}\end{document} , we obtained the centered feature vector \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} v := u - \mathrm{u}{mean}\end{document} . The reference value Z was calculated as follows:
\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \begin{eqnarray*} \mathrm{Z} = 0.6 \times max(v) + \mathrm{u}_{\mathrm{ mean}} \end{eqnarray*}\end{document}An array corresponding to the position \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} > \mathrm{Z}\end{document} was extracted as the active array. PWM was calculated for the active array to obtain the weighted occurrence probability of the sequence elements. To analyze amino acid sequence activation, we searched for the most similar motif by comparing the calculated PWM with the PWM stored in the PROSITE database [43]. For motif searching, we used the Tomtom database [44]. To analyze nucleotide sequence activation, we calculated the total sum of the corrected occurrence probabilities for each nucleotide in the PWM of each filter and chose the top two nucleotides with the highest total values. The number of the obtained nucleotides was counted, and the occurrence ratio of the nucleotides was calculated. (ii) We reduced 140 dimensions of the features output by the model using the test dataset to two dimensions using t-Distributed Stochastic Neighbor Embedding (t-SNE) [45]. By plotting the reduced features in a 2D space, we qualitatively evaluated the features of the decision boundary regarding the presence or absence of gene regulatory relationships.
To evaluate whether the mixture of features contributed to the prediction performance, the decision boundaries for the presence or absence of the gene regulatory relationship were visualized (Supplementary Fig. S7A). To improve visibility, we visualized each plot in the dimensionality-reduced space for 10 TFs and 100 target genes selected randomly from the test dataset. If the features were not mixed by the model, all dots should be concentrated at the same point without crossing the decision boundary (Supplementary Fig. S7B). If the features were mixed, all dots should be distributed across the decision boundary (Supplementary Fig. S7C).
Our model attempts to predict gene regulatory relationships for unknown TFs by using feature similarity to the learned sequence. To evaluate whether the TF features learned by the model reflect the similarity between TFs, we analyzed sequence similarity between a TF and its neighboring TFs on the plot in the dimensionality-reduced space using Basic Local Alignment Search Tool (BLAST) [46]. To simplify the analysis, we selected the TFs with a prediction accuracy of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \ge\end{document} 20% higher by our model than those of conventional deep learning models. We visualized the relationship between the number of datapoints and the proportion of neighboring TFs homologous to the TF of interest. We also analyzed sequence similarity between the unknown TFs present only in the test dataset and their neighboring TFs, and calculated the percentage of TFs in the neighborhood homologous to the unknown TF.
Visualization of GRNs using adjacency matrices
To qualitatively evaluate GRNs containing each predicted gene regulatory relationship using each model, the adjacency matrices were visualized. Hierarchical clustering using Ward’s method was performed for each TF. Sematic clustering was then performed for each gene. In semantic clustering, the genes with CpG promoters, which tend to be bound by multiple TFs [47, 48], and the genes without CpG promoters were clustered separately. To annotate CpG promoters, CpG islands within the 1000-bp upstream region of each target gene were retrieved from the UCSC Genome Browser (’cpgIslandExtUnmasked’ table) using rtracklayer [49]. If the predicted gene regulatory relationship patterns differed between genes with CpG promoters, which affect various gene regulation relationships [50], and those without CpG promoters, this indicated that the model likely learned only a biologically trivial rule based on whether a gene contains a CpG promoter. If the predicted patterns were similar between these two gene groups, this suggested that the model learned interactions between TFs and genes, rather than relying solely on CpG promoter presence.
Results
Proposed model outperformed conventional models in predicting gene regulatory relationships for known TFs
The proposed model GReNIMJA outperformed conventional models in almost all evaluation metrics (Table 1). Its highest accuracy (0.844) indicated that the model was the most accurate at predicting gene regulatory relationships. The highest F-measure (0.847) indicated the best balance between false negatives and false positives. The highest AUROC (0.928) and AUPR (0.929) indicated that the model was more robust than the conventional models. The MCC value of GReNIMJA was also the highest (0.689); MCC evaluates precision while adjusting the balance of the number of datapoints. Because the number of datapoints in the constructed dataset was larger for positive data than for negative data, GReNIMJA had high precision in predicting not only positive but also negative data. In summary, GReNIMJA outperformed conventional models in terms of prediction accuracy, balance between false negatives and false positives, robustness, and precision.
The number of datapoints for TFs tended to decrease as the sequence length increased (Supplementary Fig. S6). To evaluate model performance for TFs of different length, we analyzed the accuracy, AUROC, and AUPR for each TF. GReNIMJA outperformed the conventional models in all evaluation metrics for each TF (Table 2). Accuracy, AUROC, and AUPR of the prediction by GReNIMJA for all TFs were 0.011, 0.013, and 0.016 points higher, respectively, than those of TBiNet, which showed the highest performance among the conventional models (Table 1). For each TF, these metrics were 0.093, 0.058, and 0.049 points higher in GReNIMJA than in TBiNet (Table 2). Thus, the degree of performance improvement relative to the conventional models was higher for each TF than for all TFs. These findings and the distribution of the sequence lengths (Supplementary Fig. S6) suggested that GReNIMJA predicts the gene regulatory relationships more accurately and robustly even for the TFs with few datapoints and long sequences. Thus, GReNIMJA would be useful for the identification of regulatory relationships for various TFs and genes.
Because the choice of the embedding algorithm is also crucial for model performance, we also performed a performance evaluation using one-hot encoding as an embedding baseline and found that word2vec consistently outperformed one-hot encoding (Supplementary Table S2). In comparison with discrete one-hot representations, continuous latent representations learned by word2vec have more degrees of freedom, which may have contributed to the performance improvement.
Analysis of hyperparameter importance in our model revealed that the number of channels in the convolution layer in the feature extraction section for amino acid sequences was the most sensitive parameter to the model performance, followed by the number of channels in the 2D LSTM architecture and that for nucleotide sequences, whereas kernel strides in the convolution layers for both nucleotide and amino acid sequences were the most robust parameters (Supplementary Fig. S8). These findings suggest that hyperparameters related to the degree of freedom in the model, such as the number of channels, tend to have a greater impact on performance than those controlling the resolution of feature extraction in amino acid or nucleotide sequences, such as kernel strides.
Analysis of sequence similarity between TFs in the training and test datasets revealed that the mean alignment score of TF pairs from both the training and test datasets (−0.2362) was lower than those of TF pairs within the training dataset (0.04425) or the test dataset (0.06774) (Supplementary Fig. S9A). The median max identity score between TFs from the test and training datasets was 61.36% (Supplementary Fig. S9B). The sequence similarity between the training and test datasets was substantially lower than that within each dataset. This indicated that our dataset has low sequence similarity and the risk of data leakage is minimal.
Proposed model was robust against the small number of datapoints and long sequences of TFs
The relationship between the number of datapoints and the prediction accuracy of our model is shown in Fig. 2. Prediction accuracy tended to deteriorate with the decrease in the number of datapoints, but remained higher than that of conventional models at a number of datapoints <1000 (Fig. 2A). In addition, the accuracy of our model was not dependent on sequence length (Fig. 2B). The proportion of positive (true or false) and negative (true or false) predictions was similar regardless of sequence length (Fig. 2C). This suggests that sequence length do not affect the bias of our model. Overall, GReNIMJA was able to robustly predict gene regulatory relationships even for the TFs with a small number of datapoints and long sequences.
Effect of the number of datapoints and TF length on model performance. (A) The horizontal axis represents the number of TFs in the training dataset, and the vertical axis represents the accuracy for each TF in the test dataset. The number of TFs was calculated as the median of the number of datapoints included in each fold of the training dataset. The area surrounded by the dotted line represents the region where the performance of our model exceeded that of the conventional model for a TF with < 1000 datapoints. (B) Average accuracy according to TF sequence length. The groups of sequence lengths were constructed in increments of 50 residues. (C) The relationship between TF length and TF frequency predicted in the indicated categories. The groups of sequence lengths were constructed in increments of 50 residues.
Proposed model acquired biological context related to gene regulatory relationships
To interpret the basis of the predictions by our model, we analyzed the features of the convolution and last output layers. In the convolution layer, we found that a total of 11 types of TF domains were activated, including leucine zippers and zinc fingers (Table 3) In nucleotide sequences, our model focused on AT content (Table 4). In the last output layers, a dimensionality reduction analysis by t-SNE showed that the positive and negative data were concentrated in the same region and formed clusters (Supplementary Fig. S10). Our model might improve performance as a result of learning a decision boundary that forms easily separable clusters. It is noteworthy that it learned the decision boundary of a gene regulatory relationship by using the biological domain context as a basis for the prediction, even though the model was not explicitly given any biological knowledge that was important for gene regulatory relationships.
To evaluate the effect of feature mixing on model performance, we visualized the relationship between the decision boundary and feature plots that randomly selected from the test dataset in the dimensionality-reduced space. If the features were not mixed, the embedded numerical vectors of amino acid sequences and nucleotide sequences were aggregated in the same region, but if the features were mixed, each feature of TF or gene was distributed across the decision boundary. The feature dots of the TFs tended to form clusters across the decision boundary (Fig. 3A), whereas the feature dots of the genes were sparsely scattered across the decision boundary (Fig. 3B). This suggests that our model considers the interaction between TFs and genes when predicting the regulatory relationships between them, in agreement with biological knowledge.
Effect of feature mixing on the prediction of gene regulatory relationships by our model. Dimension reduction of the features obtained from the output layer using t-SNE was performed. Each dot represents features corresponding to each TF and gene. Left panels, randomly selected (A) TFs or (B) genes. Dots of different colors represent different TFs or genes; light blue, TFs or genes that were not selected. Right, prediction results for the gene regulatory relationships for each feature. Blue, features predicted as positive; red, features predicted as negative. Detailed information about the approaches used is provided in the ‘Materials and methods’ section.
We designed our model to predict gene regulatory relationships by using feature similarity to the learned sequence even if the number of datapoints for a TF is small and this TF has not been learned. To evaluate whether the learned features of a TF preserve the TF characteristics, we analyzed sequence similarity between a TF and its neighboring TFs on the plot in the dimensionality-reduced space. For TFs with few datapoints, neighboring TFs had high similarity to the TF of interest (Fig. 4A). For unknown TFs, neighboring TFs had high similarity at a rate of 10%–20% for the TF of interest (Fig. 4B). These results indicated that our model performed the prediction by using features similar to those of the learned TFs.
Homology analysis. (A) Relationships between the number of datapoints for the TF and the percentage of neighboring TFs homologous to the TF of interest in the training dataset. The TFs of interest were selected from among those with >1000 datapoints and those with the gene regulatory relationships predicted with an accuracy of \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \end{document}20% higher than that of TBiNet. Each dot represents one TF. (B) Proportion of neighboring TFs found to be homologous to the unknown TF of interest. The IDs of unknown TFs were assigned randomly.
Overall, our model achieved highly accurate predictions of gene regulatory relationships by considering the interactions between TFs and genes, focusing on the DNA-binding domains of the TFs and AT content of the genes, and using features similar to the learned TFs.
Our model achieved high-precision predictions even for unknown TFs
Our model accurately predicted gene regulatory relationships for unknown TFs (Table 5). In all evaluation metrics, its performance was higher than that of those traditional statistical analysis methods and baseline machine learning methods (Table 5). The highest accuracy of our model (0.685) indicated that it most accurately predicted gene regulatory relationships. The highest AUROC (0.698) and AUPR (0.694) of our model indicated that it can most robustly estimate gene regulation relationships. Because our model can infer gene regulatory relationships accurately and robustly for unknown TFs, it can be useful for the identification of diverse gene regulatory relationships, in particular for TFs that have not even been detected experimentally.
To qualitatively evaluate GRNs based on prediction by each model, adjacency matrices were visualized (Fig. 5). Among the genes examined, 12 800 contained CpG promoters; 12 374 did not; and 1729 genes could not be annotated because of the absence of corresponding database entries. The clustering results showed that the prediction patterns for the baseline models (logistic regression, SVM, and XGBoost) differed markedly between genes with and without CpG promoters, suggesting that these models relied on the presence or absence of CpG promoters. In contrast, the prediction patterns for our model remained largely consistent regardless of CpG promoter status. Taken together with the results shown in Fig. 3, these findings suggest that our model makes predictions by learning TF–gene interactions, rather than by relying solely on CpG promoter content.
Visualization of adjacency matrices for each unknown TF in the prediction of regulatory relationships. Gene regulatory relationships are shown in the figure key; NA, relationships that are not yet annotated. The vertical axes represent the indices of TFs after hierarchical clustering, while the horizontal axis represents the indices of genes after semantic clustering, as for the presence or absence of CpG promoters.
Discussion
In this study, we developed a deep learning–based model, GReNIMJA, that is able to predict gene regulatory relationships for unknown TFs (Fig. 1). GReNIMJA did for known TFs so better than did the conventional models did (Tables 1 and 2). The model was robust to the number of datapoints and TF sequence length (Fig. 2). By analyzing the features of GReNIMJA, we revealed that it predicted gene regulatory relationships by using mixed information on TFs and target genes (Fig. 3). As a basis for the prediction, GReNIMJA learned the biological domain context, namely the DNA-binding domains in the TFs and the AT content of the target genes (Tables 3 and 4). Homology analysis revealed high similarity between TFs that had similar features learned by GReNIMJA (Fig. 4). The model might have learned latent representations that reflect the homology among TFs. By leveraging the latent representations of similar TFs used for pre-training, the model was able to predict gene regulatory relationships with high accuracy even for unknown TFs. Consistent with this result, we showed that GReNIMJA predicted the regulatory relationships of unknown TFs with high performance (Table 5 and Fig. 5). Our framework has the potential to identify invariant latent representations that are essential for prediction of gene regulatory relationships of unknown TFs. In contrast, baseline machine-learning models, such as XGBoost, can directly learn decision boundaries to classify the presence or absence of gene regulatory relationships, but cannot inherently acquire such invariant latent representations. The most important contribution of our work is not the improvement of prediction performance using a complex deep learning model, but the establishment of a new methodological framework that can learn invariant latent representations of gene regulatory relationships, enabling prediction of gene regulation for unknown TFs. This task is central to the identification of GRNs and, to the best of our knowledge, it has not previously been successfully completed. Prediction of gene regulatory relationships requires prediction of not only the binding between TFs and genes, but also whether the TFs regulate these genes or not [51]. Although conventional deep learning–based models are able to accurately predict whether or not a specific TF and gene will bind to each other [17, 18], GReNIMJA can also predict the presence or absence of gene regulatory relationships on the basis of biological domain knowledge. This fundamental difference might have contributed to the improved performance.
We found that PS00029 (leucine zipper), which is DNA-binding domains [52], was activated in the convolution layer (Table 3), The scavenger receptor cysteine-rich (SRCR) domain was also activated. This domain is present in cell-surface and secreted proteins mainly related to the immune system [53], and it mediates protein–protein interactions [54]. As the binding specificity of TFs can be changed through interactions with other proteins [55], the SRCR domain might affect gene regulatory relationships through such interactions. Together, these findings suggest that the performance of GReNIMJA was improved by its ability to recognize the DNA-binding domains, which are important in gene regulation, and the SRCR domain, which affects gene regulatory relationships through protein–protein interactions. However, plant-specific domains such as PS01361 (zinc finger) and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \alpha\end{document} and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{upgreek} \usepackage{mathrsfs} \setlength{\oddsidemargin}{-69pt} \begin{document} \beta\end{document} subunits of antenna complexes were also activated in the convolution layers. These results are biologically incorrect and therefore represent false-positive signals. Consequently, it is essential to interpret activation patterns in the context of established biological knowledge.
We found that ATs in nucleotide sequences were activated in the convolution layer (Table 4). Analysis of the nucleotide content at binding sites and nonbinding sites of TFs has revealed that TFs have binding sequence preferences; some TF families tend to bind to AT-rich or GC-rich sequences [56]. These considerations suggest that the performance of GReNIMJA was improved by its ability to recognize whether the gene nucleotide sequence was AT-rich or not. Promoters have odd sequence characteristics shaped by evolution, such as GC richness, that do not directly affect gene regulation [57]. Therefore, we should note that not only first-order sequence statistics, such as GC richness, but also higher-order interactions between gene and TFs may influence gene regulatory relationships. Indeed, as shown in Fig. 3, our model appears to learn these interactions to predict the TF–gene relationships. Although directly interpreting such higher-order interactions in deep learning models is inherently challenging due to their high dimensionality, further investigation into these interactions may provide insights that could improve model performance.
Predictions by GReNIMJA were based only on sequence information, but gene regulatory relationships are also influenced by the 3D structure of DNA and geometric variations in TF structures [58], as well as hydrophobicity [59]. These biophysical properties that govern higher-order functions related to DNA and proteins represent fundamental mechanisms shared across organisms [60]. The biophysical properties are encoded by nucleotide sequences in the regulatory regions of genes and by TF amino acid sequences [61–63]. By incorporating the 2D LSTM module, which explicitly models interactions between the TF and gene regulatory sequences, GReNIMJA may indirectly capture these biophysical features of interactions. We therefore speculate that this characteristic of our model enabled versatile prediction of the gene-regulatory relationships of unknown TFs. On the other hand, our model cannot handle multi-omics data such as epigenetic or spatial structural information data. A gene regulatory relationship is affected not only by the interaction between the TF and regulatory gene regions, but also by epigenetic or spatial properties [64–67]. Incorporating the thermodynamic properties of DNA improves the performance of gene regulation prediction [68]; therefore, explicit integration of such biophysical properties and multi-omics data in GReNIMJA might be useful to predict gene regulatory relationships more generally. The generalizability of our approach could be further improved through adaptive approaches (such as few-shot learning or transfer learning) so that it could be extended to other species, such as mice, rats, and dogs. By generalizing GReNIMJA, we may be able to obtain gene regulatory relationships as blueprints underlying biological phenomena, paving the way for a deeper understanding of diverse life processes.
PWMScan has long been used as a powerful tool for elucidating gene regulatory relationships [12]. Because PWMScan is grounded in statistical knowledge of sequence motifs, its predictions are highly interpretable, one of its major strengths. However, PWMScan uses a predefined threshold to predict TF–DNA binding and therefore bears an inherent risk of imbalances between false negatives or false positives. PWMScan had a high rate of false negatives, whereas the other deep learning models tested achieved a better balance between Recall and Precision, resulting in a higher prediction performance (Tables 1 and 2). Although we attempted to interpret the prediction biases of our trained model as thoroughly as possible (Tables 3 and 4, and Figs 4 and 5), the trade-off between prediction performance and interpretability remains a limitation of this study.
Regarding the embedding algorithm, we showed that word2vec consistently outperformed one-hot encoding in our model (Supplementary Table S2). However, different embedding algorithms result in different embedding dimensions. Since there are 20 amino acids and 4 nucleotides, the number of dimensions of one-hot vectors is fixed at 20 for TFs and at 4 for genes. In this study, following prior work [24, 25], the dimensions of word2vec embeddings were set to 100 for TFs and 50 for genes. Differences in the number of embedding dimensions inevitably affect the number of model parameters. Therefore, both the number of embedding dimensions and the resulting number of model parameters may have affected model performance.
We should treat the negative TF–gene pairs in our dataset with appropriate caution. In our customized dataset, we implicitly assumed that all unobserved interactions are truly negative, but some TF–gene pairs labeled as negative might be true positives, as comprehensive experimental validation is currently lacking. Thus, performance evaluation in this study may be biased by false negatives. Ideally, the definition of negative TF–gene pairs should be based on the absence of regulatory interactions in knockdown or knockout experiments or reporter assays. Ascertaining the absence of interactions by using comprehensive and multifaceted experimental evaluation is strongly required. This bias of model performance evaluation is a technical limitation of this study.
Results of the sequence similarity between TFs in the training and test datasets suggested that our dataset has low sequence similarity and thus the risk of data leakage is minimal. Even if the dataset contained data leakage, other baseline models, trained and evaluated on the same dataset, would also be expected to achieve an AUPR of ∼93% because of the data leakage. In our performance evaluation (Table 1), the average AUPR of the baseline models was ∼79%. The AUPR of 93% of our model does not suggest data leakage directly, but rather demonstrates that our model is genuinely more effective than the baseline models. Because TFs consist of 20 amino acids and genes consist of 4 nucleotides, it is impossible to construct a ‘perfect’ dataset in which all sequences are dissimilar. Constructing training, validation, and test sets that completely eliminate sequence identity would require the model to predict patterns that are entirely unknown. From the perspective of deep learning, this task requires an extreme extrapolation, which is highly challenging or currently infeasible [69–71]. This limitation is not unique to our model but is a long-term fundamental challenge in the field of deep learning based on biological data.
Databases such as DoRothEA and Harmonizome include gene regulatory pairs identified through statistical inference and may have a false-positive bias. They also include pairs for which binding was detected by ChIP-seq only. However, determining whether the gene regulatory relationship exists requires evidence not only for physical binding but also for the actual regulation of gene expression, such as RNA-seq. Because of these limitations, our dataset may also suffer from false-negative bias. Overall, we consider that false-positive and false-negative biases might be caused by database limitations. A detailed analysis of these biases would require evaluating all TF pairs using experimental approaches such as RNA-seq.
Whereas this study focused on gene–TF pairs, an important future direction would be extending our approach to learning GRNs using a graph-based approach, such as graph neural networks. Graph-based approaches are powerful for predicting many-to-many gene regulatory relationships among multiple nodes [72, 73] and are essential for efficient GRN inference. Because our method predicts the gene regulatory relationship within one pair of nodes (a minimal unit of a GRN), it could be incorporated into graph-based frameworks to improve GRN prediction performance.
In summary, GReNIMJA predicts regulatory relationships between TFs and genes with high accuracy and robustness, even for unknown TFs. Considering its performance for unknown TFs, the proposed model could also be applied to TFs whose binding properties and roles in gene regulation have not yet been characterized experimentally. We expect this model to contribute to a deeper understanding of biological phenomena by accurately identifying complex gene regulatory relationships that have not yet been fully elucidated.
Supplementary Material
lqag022_Supplemental_File
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Lambert SA, Jolma A, Campitelli LF et al. The human transcription factors. Cell. 2018;172:650–65. 10.1016/j.cell.2018.01.029.29425488 PMC 12908702 · doi ↗ · pubmed ↗
- 2Hinman VF, Nguyen AT, Cameron RA et al. Developmental gene regulatory network architecture across 500 million years of echinoderm evolution. Proc Natl Acad Sci USA. 2003;100:13356–61. 10.1073/pnas.2235868100.14595011 PMC 263818 · doi ↗ · pubmed ↗
- 3Boiani M, Schöler HR. Regulatory networks in embryo-derived pluripotent stem cells. Nat Rev Mol Cell Biol. 2005;6:872–81. 10.1038/nrm 1744.16227977 · doi ↗ · pubmed ↗
- 4Taylor-Teeples M, Lin L, De Lucas M et al. An Arabidopsis gene regulatory network for secondary cell wall synthesis. Nature. 2015;517:571–5. 10.1038/nature 14099.25533953 PMC 4333722 · doi ↗ · pubmed ↗
- 5Yamada TG, Hiki Y, Hiroi NF et al. Identification of a master transcription factor and a regulatory mechanism for desiccation tolerance in the anhydrobiotic cell line Pv 11. P Lo S One. 2020;15:e 0230218. 10.1371/journal.pone.0230218.32191739 PMC 7082025 · doi ↗ · pubmed ↗
- 6Carro MS, Lim WK, Alvarez MJ et al. The transcriptional network for mesenchymal transformation of brain tumours. Nature. 2010;463:318–25. 10.1038/nature 08712.20032975 PMC 4011561 · doi ↗ · pubmed ↗
- 7Sharma A, Halu A, Decano JL et al. Controllability in an islet specific regulatory network identifies the transcriptional factor NFATC 4, which regulates type 2 diabetes associated genes. NPJ Syst Biol Appl. 2018;4:25. 10.1038/s 41540-018-0057-0.29977601 PMC 6028434 · doi ↗ · pubmed ↗
- 8Johnson DS, Mortazavi A, Myers RM et al. Genome-wide mapping of in vivo protein–DNA interactions. Science. 2007;316:1497–502. 10.1126/science.1141319.17540862 · doi ↗ · pubmed ↗
