Proteome-wide prediction of the mode of inheritance and molecular mechanisms underlying genetic diseases using structural interactomics
Ali Saadat, Jacques Fellay

TL;DR
This paper presents a new method using protein networks and structures to predict how genetic diseases are inherited and their molecular causes.
Contribution
A novel graph-of-graphs approach combining structural interactomics and graph neural networks for proteome-wide disease inheritance prediction.
Findings
The method predicts mode of inheritance and functional effects of genetic variants across autosomal proteins.
Feature attribution provides biological insights into disease mechanisms.
Proteome-wide predictions are publicly available for broad use.
Abstract
Genetic diseases can be classified according to their modes of inheritance and their underlying molecular mechanisms. Autosomal dominant disorders often result from DNA variants that cause loss-of-function, gain-of-function, or dominant-negative effects, while autosomal recessive diseases are primarily linked to loss-of-function variants. In this study, we introduce a graph-of-graphs approach that leverages protein-protein interaction networks and high-resolution protein structures to predict the mode of inheritance of diseases caused by variants in autosomal genes and to classify dominant-associated proteins based on their functional effect. Our approach integrates graph neural networks, structural interactomics, and topological network features to provide proteome-wide predictions, thus offering a scalable method for understanding genetic disease mechanisms. •A graph-of-graphs…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 1
Figure 2
Figure 3
Figure 4
Figure 5
Figure 6
Figure 7
Figure 8Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBioinformatics and Genomic Networks
Introduction
Most human genetic diseases result from variants that disrupt protein function through diverse molecular mechanisms, which play a critical role in determining their mode of inheritance (MOI).1 In autosomal dominant (AD) disorders, a single copy of a mutated gene can result in disease, often through loss of function (LOF) due to haploinsufficiency (HI), where the remaining wild-type allele cannot compensate for the lost function.2 Dominant disorders can also result from non-LOF mechanisms, such as gain of function (GOF), where the mutant protein acquires a new or altered function, and the dominant-negative (DN) effect, where the mutant protein interferes with the normal function of the wild-type protein.3 In contrast, autosomal recessive (AR) disorders require variants in both gene copies, predominantly involving LOF mechanisms, such as missense variants that destabilize protein structure or nonsense variants leading to truncated, non-functional proteins.
Previous studies on MOI prediction have introduced computational tools such as DOMINO,4 which utilizes linear discriminant analysis (LDA) to predict whether a protein is associated with AD disorders by integrating various features such as genomic data, conservation, and protein interactions. MOI-Pred,5 on the other hand, focuses on variant-level predictions, specifically targeting missense variants associated with AR diseases.
More recent research has aimed at predicting the functional impact of variants in specific genes. LoGoFunc combines gene-, protein-, and variant-level features to predict pathogenic GOF, LOF, and neutral variants.6 Another study explored the structural effects of variants, finding that non-LOF variants tend to have milder impacts on protein structure.7 Additionally, a recent study employed three support vector machines (SVMs) to predict protein coding genes associated with DN, GOF, and HI mechanisms.8
In this study, we present a comprehensive approach for predicting the MOI for all proteins encoded by autosomal genes, as well as elucidating the functional effect of variants underlying AD genetic disorders (Figure 1). Our framework combines graph neural networks (GNNs)9 with structural interactomics by creating a graph of graphs,10 utilizing both protein-protein interaction (PPI) network and high-resolution protein structures. For MOI prediction, we model proteins as nodes within the PPI network, incorporating topological and protein-level features for classification. For molecular mechanism prediction, we represent each protein as a graph of amino acid residues, leveraging structure-based features to classify the functional effect as HI, GOF, or DN. This integrated approach enables proteome-wide prediction of inheritance patterns and provides mechanistic insights into AD diseases, offering a novel, scalable framework for understanding genetic disorders.Figure 1. Overview of the studyThe mode of inheritance (MOI) is first predicted for all autosomal proteins in the protein-protein interaction network. Next, AlphaFold-predicted protein structures are used to generate residue-level graphs for each dominant protein, which are then used to predict functional effects. Figure created with https://www.biorender.com.
For the sake of flow and conciseness, we refer to “proteins associated with AD disorders” as AD proteins and “proteins associated with AR disorders” as AR proteins. Similarly, we use DN (GOF/LOF) proteins instead of “proteins associated with DN (GOF/LOF) molecular disease mechanisms.”
Results
Datasets
MOI data
We gathered 4,737 MOI-labeled proteins; among them, 2,494 (53%) were only AR, 1,420 (30%) were only AD, and 808 (17%) were both AD and AR (Figure 2, left).Figure 2. Summary of MOI dataLeft panel shows the number of proteins with annotated MOI. Right panel displays the number of proteins with annotations regarding their molecular mechanism.
Functional effect data
We collected 1,276 proteins with annotated functional effect; among them, 250 (20%) were only DN, 376 (29%) were only HI, 251 (20%) were only GOF, 114 (9%) were both DN and HI, 115 (9%) were both DN and GOF, 92 (7%) were both HI and GOF, and 78 (6%) were all of the DN, HI, GOF (Figure 2, right).
PPI construction and annotation
We constructed a comprehensive PPI network comprising 17,248 nodes and 375,494 edges by integrating interactions from STRINGdb (search tool for the retrieval of interacting proteins database),11 BioGRID (biological general repository for interaction datasets),12 the Human Reference Interactome (HuRI),13 and Menche et al.14 To characterize proteins, we annotated them with 78 selected features covering structural, functional, evolutionary, and regulatory properties (Table S1).
Protein graph construction and annotation
We obtained predicted protein structures from the AlphaFold database15 and constructed residue-level graphs using Graphein.16 In these graphs, nodes represent amino acids, while edges capture peptide bonds, hydrogen bonds, disulfide bonds, ionic interactions, and other structural contacts, including long-range interactions. We annotated amino acid residues with 73 selected features reflecting structural, sequence-based, biochemical, and evolutionary characteristics (Table S2).
Model development
Study design
We formulated MOI prediction as a node classification task within the PPI network and functional effect prediction as a graph classification task. Both models employed a multi-label classification approach, allowing each input to have multiple labels. We evaluated various GNN architectures, including graph convolutional networks (GCNs),17 graph attention networks (GATs),18 and graph isomorphism networks (GINs).19
Data splitting
To construct training, validation, and test sets, we clustered human protein sequences using MMseqs220 with stringent thresholds of 20% sequence identity and 20% alignment coverage. This conservative cutoff minimizes sequence similarity between splits, thereby reducing the risk of information leakage and encouraging the model to generalize. Protein clusters were then assigned to the training (80%), validation (10%), or test (10%) set.
Hyperparameter tuning and model training
All models used a single hidden layer, with the output layer containing two units for MOI prediction (AD and AR) and three units for functional effect prediction (DN, HI, and GOF). To tune the hyperparameters, we evaluated 25 configurations on the validation set by varying the hidden layer size across five values (128, 64, 32, 16, and 8) and the learning rate across five values ranging from to . The results of hyperparameter tuning for MOI and functional effect prediction are provided in Tables S3 and S4. Using the selected hyperparameters, we trained each model with binary cross-entropy loss for up to 100 epochs, applying early stopping based on validation loss to prevent overfitting.
Model performance evaluation
MOI models
We evaluated all trained models on the unseen test set (Table 1). The GCN model achieved the highest precision score, while the GAT model had the best recall and score. Due to the class imbalance in the MOI dataset, we prioritized maximizing score and selected the GAT model. We also assessed the performance of DOMINO4 as outlined in the STAR Methods section and found that our models outperformed it.Table 1MOI prediction performance on the test setMetricGCNGATGINLDA4F10.7450.7500.6710.685Precision0.7760.7700.7640.721Recall0.7250.7310.6210.654GCN, graph convolutional network; GAT, graph attention network; GIN, graph isomorphism network; LDA, linear discriminant analysis.
Functional effect models
Table 2 shows the performance of various models on the functional effect test set, with the GCN model achieving the highest score. We also evaluated the SVM models from Badonyi and Marsh8 as described in the STAR Methods section. Based on the overall performance, we selected the GCN model for functional effect prediction.Table 2. Functional effect prediction performance on the test setMetricGCNGATGINSVM8F10.6270.5900.6000.593Precision0.6050.5170.5490.669Recall0.6590.7120.6760.535GCN, graph convolutional network; GAT, graph attention network; GIN, graph isomorphism network; SVM, support vector machine.
Model interpretation
MOI feature attribution
Using the GAT model, we calculated features attribution separately for correctly predicted AD or AR proteins in the test set. We observed that the most important predictors for AD prediction are features related to constraint and conservation (Figure 3, left). The top feature was UNEECON (unified inference of variant effects and gene constraints), which measures the evolutionary pressure.21 Using the labeled data, we observed that AD proteins have higher UNEECON values compared to AR proteins (Figure 3, right).Figure 3GAT model interpretation for AD proteinsLeft panel shows the relative feature importance values for AD prediction. Right panel displays the distribution of the top feature (UNEECON score) for AD and AR proteins.
For AR prediction, the most important feature was pLI (probability of loss-of-function intolerance), which is probability of loss-of-function intolerance22 (Figure 4, left). Using the ground truth dataset, we observed that AR proteins have lower pLI values compared to AD proteins (Figure 4, right).Figure 4GAT model interpretation for AR proteinsLeft panel shows the relative feature importance values for AR prediction. Right panel displays the distribution of the top feature (pLI) for AD and AR proteins.
Functional effect feature attribution
Using the GCN model, we measured features attribution for correctly predicted DN, HI, and GOF proteins. Because features are at residue level and prediction are at protein level, we cannot draw direct conclusions from these measurements, yet they can help to understand the associations.
For DN proteins, the most important feature was the RNA-binding score based on DRNApred23 (Figure 5, left). Using the labeled data, we observed that residues in DN proteins have higher RNA-binding scores compared to HI and GOF proteins (Figure S1).Figure 5GCN model interpretationRelative feature attribution values for DN (left), HI (middle), and GOF (right) predictions.
For HI proteins, as shown in Figure 5 (middle), the topological domain is the strongest predictor. This feature was derived from UniProt.24 We observed that HI proteins have a lower fraction of topological domains compared to DN and GOF proteins (Figure S2).
Feature attribution analysis for GOF proteins showed that the top feature is the helix structure (Figure 5, right), derived from UniProt.24 The distribution of helical fractions indicates that GOF proteins have a relatively higher fraction of helical structures compared to HI and DN proteins (Figure S3).
Proteome-wide inference
MOI prediction for all autosomal proteins
Of the 17,248 nodes in the PPI network, 16,477 (96%) were autosomal, and we used the GAT model to predict the most likely MOI for all of them. A total of 8,869 (54%) were predicted to be AR, 6,277 (38%) were predicted to be AD, and 1,206 (7%) were predicted to be ADAR (autosomal dominant and autosomal recessive) (Figure S4). As expected, we observed a strong negative correlation between the probability of being AD and AR (Pearson correlation coefficient = −0.95) (Figure S5).
Finally, we performed pathway enrichment analyses for AD and AR proteins separately. AD proteins were significantly enriched in pathways associated with gene regulation (Figure 6 left), while AR proteins were significantly overrepresented in mitochondrial pathways (Figure 6, middle). Using the ground truth dataset, we observed that AR proteins are more likely to be localized inside mitochondria compared to AD proteins ( ) (Figure 6, right).Figure 6. Pathway analysis for AD and AR proteinsLeft and middle panels show top significantly enriched pathways for AD and AR proteins, respectively. Right panel shows the number of proteins associated with subcellular localization inside or outside mitochondria. The odds ratio was calculated as . p value was calculated using the Fisher’s exact test.
Functional effect prediction for all AD-predicted proteins
Based on the proteome-wide MOI predictions, we identified 7,483 AD or ADAR proteins and predicted their functional effect using the GCN model. Among them, 2,043 (28%) were classified as only DN, 1,097 (15%) as only HI, and 415 (6%) as only GOF. Additionally, 1,843 (26%) were both DN and HI, 1,569 (22%) were both DN and GOF, 181 (3%) were both HI and GOF, and 35 (1%) were classified as DN, HI, and GOF (Figure S6). We also provide the counts based on AD-only and ADAR-only proteins in Figures S7 and S8, respectively.
Pathway enrichment analysis revealed that DN proteins are enriched in pathways associated with filament organization (Figure 7, left), HI proteins are overrepresented in pathways related to transcription regulation (Figure 7, middle), and GOF proteins are enriched in pathways related to ion transport across membranes (Figure 7, right).Figure 7. Pathway analysis for DN, HI, and GOF proteinsThe panels show the top significantly enriched pathways for proteins associated with DN (left), HI (middle), and GOF (right) mechanisms.
Discussion
In this work, we introduce a novel framework that integrates GNNs with structural interactomics to predict both the MOI and the functional effects of mutated proteins in genetic disorders. By leveraging PPI network and high-resolution protein structures, we offer a graph-of-graphs approach that addresses two critical aspects of genetic disease prediction. This allows us to not only classify proteins as AD or AR but also predict whether AD diseases manifest through HI, GOF, or DN mechanisms. Our framework demonstrated good performance in predicting MOI, with the GAT model achieving the best score for identifying AD and AR proteins. In terms of functional effects, the GCN model effectively classified HI, GOF, and DN proteins based on structural features.
The most important feature in predicting AD proteins is the evolutionary pressure score (UNEECON).21 This aligns with previous studies showing that AD proteins experience stronger negative selection than AR proteins.25^,^26 Additionally, we observed a strong enrichment of AD proteins in pathways related to gene expression regulation. Prior research has shown that transcription factors (TFs) are often dosage sensitive, particularly haploinsufficient, leading to dominant disease phenotypes.27^,^28 This is consistent with the fact that many human birth defects and neurodevelopmental disorders are caused by mutations in a single copy of TFs and chromatin regulator genes.29
For AR proteins, we found that the pLI (probability of loss-of-function intolerance) index is the most important feature. The pLI index estimates the likelihood that knocking out one copy of a gene will result in a phenotype.22 Low-pLI genes typically exhibit functional redundancy or possess sufficient reserve capacity, allowing heterozygous carriers to remain asymptomatic.30 We also observed that AR proteins are enriched in mitochondrial pathways, consistent with previous findings that the vast majority of nuclear-encoded mitochondrial disease genes follow a recessive inheritance pattern.31 This bias toward recessive inheritance likely arises because defects in energy metabolism generally become pathogenic only when both alleles are disrupted. As long as one allele remains functional, mitochondrial pathways can sustain baseline energy production, preventing deleterious consequences.32^,^33
Feature attribution analysis revealed that DN proteins are strongly associated with high RNA-binding scores, consistent with the previous observation that DN mutations are enriched in nucleic acid-binding pathways.8 One possible explanation is that DN mutations often occur at critical interaction sites, such as DNA/RNA-binding interfaces, where they allow the mutant protein to retain its ability to bind partners but disrupt the overall function of the interacting complex. Additionally, we observed an enrichment of DN proteins in pathways related to filament organization. This is likely due to the inherent susceptibility of filamentous and polymeric assemblies to “poisoning” by mutant subunits, which can incorporate into multimers and destabilize the entire structure. A well-documented example is keratin-related disorders, where keratins (type I and II intermediate filament proteins) form an essential cytoskeletal network in epithelial cells. Mutations in keratin genes lead to cell fragility and are inherited in an AD manner, with the mutations exerting their effect through a DN mechanism.34
A depletion in topological domain emerged as the most important feature in predicting HI proteins. According to UniProt, the topological domain annotation defines the subcellular compartment in which each non-membrane region of a membrane-spanning protein is located. Our findings indicate that DN and GOF proteins have a higher fraction of topological domain annotations compared to HI proteins, suggesting that HI genes encode fewer membrane-spanning proteins than DN and GOF genes. This distinction may reflect fundamental differences in the functional roles of these proteins and their sensitivity to dosage effects. Furthermore, HI proteins are significantly enriched in pathways related to transcriptional regulation, consistent with previous findings that TFs are frequently dosage sensitive and particularly prone to HI.27^,^28
Finally, the most important feature for predicting GOF proteins is helix structure. This finding is consistent with previous reports showing that GOF variants are significantly more likely to occur in alpha helices.6 We also observed that GOF genes are enriched in pathways related to ion transport across membranes, further supporting the structural-functional link between helices and membrane proteins. A notable example is epilepsy-associated genes: approximately 25% of them encode ion channels, many of which causing epilepsy through a GOF mechanism.35 This association is biologically plausible, as many membrane proteins have a core architecture of transmembrane helices, which are critical for gating and transport functions.36^,^37
Moving forward, there are several avenues for expanding this work. Incorporating tissue-specific PPI networks and expression data could improve the precision of our predictions, especially for proteins with context-dependent functions.38 Additionally, expanding the model to account for more complex inheritance patterns, such as polygenic traits and epistasis, could provide a more comprehensive understanding of genetic disease.39 Moreover, improving the interpretability of models in biological contexts remains essential to derive more actionable insights from the predictions.40 Finally, integrating these predictions with other computational tools and databases could further enhance our understanding of genetic diseases by providing a more holistic view of their underlying mechanisms.41^,^42^,^43
Limitations of the study
Although our approach provides a comprehensive view of inheritance patterns and functional effects, it has several limitations. First, the availability of high-quality structural data for all human proteins remains limited, potentially affecting prediction accuracy.44 To ensure uniform coverage, we relied on AlphaFold-predicted structures, which offer high accuracy for well-folded domains but have notable drawbacks. Unlike experimentally resolved structures, AlphaFold does not capture conformational flexibility, ligand interactions, or post-translational modifications, all of which are critical for functional interpretation. Additionally, it is less reliable for intrinsically disordered regions and dynamic protein states, where structural plasticity plays a key role.45 Beyond structural considerations, our reliance on existing PPI network data introduces potential biases, as interaction coverage varies across tissues and biological contexts.38 Furthermore, class imbalance in labeled training data may affect model performance, particularly for underrepresented functional categories. Finally, while our method effectively predicts the functional effects of AD proteins, it does not extend to other inheritance patterns or interactions influenced by polygenic or epistatic effects.46
Resource availability
Lead contact
Requests for further information and resources should be directed to and will be fulfilled by the lead contact, Jacques Fellay ([email protected]).
Materials availability
This study did not generate new materials.
Data and code availability
- •Data: https://zenodo.org/records/13969533.
- •Code: https://github.com/AliSaadatV/Structural-Interactomics.
- •Other items: any additional information required to reanalyze the data reported in this article is available from the lead contact upon request.
Acknowledgments
This work was funded by the 10.13039/501100001711Swiss National Science Foundation via grant #197721 and by the Swiss 10.13039/501100007352State Secretariat for Education, Research and Innovation via contribution to project “UNDINE”, SBFI no. 23.00322.
Author contributions
Conceptualization, A.S.; methodology, A.S.; investigation, A.S. and J.F.; writing – original draft, A.S.; writing – review and editing, A.S. and J.F.; funding acquisition, J.F.; supervision, J.F.
Declaration of interests
The authors declare no competing interests.
Declaration of generative AI and AI-assisted technologies in the writing process
During the preparation of this work, the authors used GPT-4 in order to improve writing and readability.
STAR★Methods
Key resources table
REAGENT or RESOURCESOURCEIDENTIFIERDeposited dataAll pre-processed data essential for reproducing this paper resultsThis paperhttps://zenodo.org/records/13969533Mode of inheritance dataGenCC47https://thegencc.orgMode of inheritance dataOMIM48https://omim.orgMolecular mechanism dataBadony and Marsh8https://osf.io/z4dcp/Protein-protein interaction dataSTRINGdb11https://string-db.orgProtein-protein interaction dataBioGRID12https://thebiogrid.orgProtein-protein interaction dataHuRI13http://www.interactome-atlas.orgProtein-protein interaction dataMenche et al.14https://doi.org/10.1126/science.1257601Predicted protein structuresAlphaFold15https://alphafold.ebi.ac.ukProtein SequencesUniProt24https://www.uniprot.orgSoftware and algorithmsCode essential for reproducing this paper resultsThis paperhttps://github.com/AliSaadatV/Structural-InteractomicsPython 3Python Software Foundationhttps://www.python.orgMMseqs2Steinegger and Soding20https://github.com/soedinglab/MMseqs2
Method details
Data collection
Mode of inheritance
We collected the MOI data from the Gene Curation Coalition (GenCC)47 as well as the Online Mendelian Inheritance in Man (OMIM).48 For GenCC records, we kept records with definitive, strong, or moderate gene-disease clinical validity. We focused on autosomal proteins, due to intrinsic differences in MOI for X chromosome proteins. Proteins were accordingly labeled as AD, AR, or ADAR (both dominant and recessive).
Molecular mechanism
We collected the functional effect of AD proteins from Badonyi and Marsh.8 This is a curated set of AD proteins labeled with their known functional effects, including DN, GOF, and HI.
PPI network
To make a comprehensive PPI network, we combined the interaction from four resources: STRINGdb with interaction score 0.7,11 BioGRID,12 the Human Reference Interactome (HuRI),13 and Menche et al.,14 which resulted in a network with 17,248 nodes, and 375,494 edges.
Protein graph
We downloaded the predicted structures of all human proteins from the AlphaFold database.15 We then used Graphein16 to construct a residue graph for each protein based on its structure. In these residue graphs, nodes represent amino acids, and edges capture various interactions between them, including peptide bonds, aromatic interactions, hydrogen bonds, disulfide bonds, ionic interactions, aromatic-sulfur interactions, and cation-π interactions. To account for long-range amino acid interactions, we included edges between amino acids that are spatially close (<5 angstroms) but distant in the sequence (>5 amino acids apart).
Protein features
We annotated all proteins with 97 features covering various aspects of protein characteristics, including structure and function, conservation and constraint, and expression and regulation. Using the training set, we excluded features with low variance (<0.1) or high correlation (>0.8), resulting in a final selection of 78 features. The complete list of initial and selected protein features is provided in Table S1.
Residue features
For the residue graphs, we annotated the nodes (i.e. amino acids) with 132 features covering various aspects including structure and function, sequence, biochemical, and evolutionary charecteristics. Using the training set, we excluded features with low variance (<0.01) or high correlation (>0.8), resulting in a final selection of 73 features. The complete list of initial and selected residue features is provided in Table S2.
Model development
Study design
In this study, MOI is predicted by classifying nodes in a PPI network, while functional effect prediction is framed as a graph classification task. Both models use a multi-label classification approach, allowing inputs to have more than one label. All experiments were conducted using the PyTorch Geometric library.49
We leverage GNNs because they can directly utilize the relational structure of the data by propagating information through the graph,50 whereas traditional machine learning models require explicit feature engineering to incorporate graph-based information. For MOI prediction, PPI networks encode valuable topological properties that influence disease mechanisms. GNNs capture both individual protein features and their connectivity within the network, an aspect that conventional ML models struggle to integrate effectively. Similarly, for functional effect prediction, protein structure graphs provide spatial and biochemical context at the residue level. While traditional ML approaches require manually extracting graph-based features (e.g., graph centrality, residue connectivity), GNNs inherently integrate these properties, enabling a more comprehensive representation of structural and functional information.
Architecture
For both MOI and functional effect prediction, we utilized various graph neural network architecture including graph convolutional network (GCN),17 graph attention network (GAT),18 and graph isomorphism network (GIN).19
GCNs extend the concept of convolution from grid-like data (such as images) to graph data, allowing the aggregation of feature information from neighboring nodes. This approach effectively captures local graph structure and node features. The forward propagation formula in a GCN is given by:
- (1) : The node feature vector at layer l.
- (2) : The updated node feature vector at layer .
- (3) : The learnable weight matrix for layer l.
- (4) : The set of neighbors of node i (including itself due to the self-loop).
- (5) : The normalization term based on the degrees of nodes i and j, ensuring that nodes with different degrees contribute proportionally to the update.
GINs are designed to be powerful for graph isomorphism, making them capable of distinguishing a wide variety of graph structures. They achieve this by using a multi-layer perceptron (MLP) to aggregate node features, enhancing their discriminative power. The update rule for the GIN is given by:
- (1) : The node feature vector at layer l.
- (2) : The updated node feature vector at layer .
- (3) : A multi-layer perceptron applied at layer l, which acts as a learnable transformation function on the aggregated node features.
- (4) : A learnable parameter at layer l that adjusts the contribution of the central node’s own features .
- (5) : The set of neighbors of node i. The sum aggregates the features of all neighbor nodes in layer l.
GATs introduce attention mechanisms to GNNs, enabling nodes to assign different importance weights to their neighbors. This allows for more flexible and expressive feature aggregation, potentially improving performance on tasks where certain neighbors have more influence than others. The forward propagation rule for GAT is given by:
- (1) : The node feature vector at layer l.
- (2) : The updated node feature vector at layer 1.
- (3) : The attention coefficient between nodes i and j.
- (4) : The weight matrix at layer l.
- (5)a: The learnable attention vector.
- (6) : The concatenation operator.
- (7) : The set of neighbors of node i.
- (8) : A non-linear activation function (ReLU in our implementation).
Hyperparameters
In all models, we employed a single hidden layer, with the output layer comprising two units for MOI prediction (AD and AR) and three units for functional effect prediction (DN, HI, and GOF). To identify configurations yielding good performance, for each model we evaluated 25 combinations of hyperparameters, varying the hidden layer size across five values (128, 64, 32, 16, and 8) and the learning rate across five values (ranging from to ). The results of hyperparameter tuning for MOI and functional effect prediction are available in Tables S3 and S4, respectively.
Other hyperparameters were set to commonly used values, including a weight decay of and a dropout rate of 0.3. For optimization, we employed the Adam optimizer51 with an adaptive learning rate scheduler (ReduceLROnPlateau), which dynamically adjusted the learning rate based on validation loss.
Training and evaluation
To create train, validation, and test splits, we clustered protein sequences using MMseqs220 with thresholds of 20% coverage and 20% sequence identity. For each model, the proteins were divided into 80% training, 10% validation, and 10% testing sets, ensuring minimal data leakage by performing the splits at the cluster level.
We trained each model using a binary cross entropy loss for maximum 100 epochs and used early stopping based on validation loss to avoid over-fitting. We evaluated each selected model on the unseen test data using , precision, and recall scores.
We benchmarked the performance of our model against previous state-of-the-art approaches. For MOI prediction, we compared our model with DOMINO,4 which predicts the probability of a protein’s association with dominant disorders (pAD). We used our MOI test set and excluded any proteins present in DOMINO’s training data. Since no threshold was provided, we classified proteins as AD if pAD 0.6, AR if pAD 0.4, and ADAR otherwise.
For functional effect prediction, we compared our model with those from Badonyi and Marsh,8 which include three separate SVM classifiers: DN vs. LOF, GOF vs. LOF, and LOF vs. non-LOF. To evaluate performance in a multi-label classification setting, we combined the test sets from these models and utilized the pre-calculated probabilities. We did not benchmark against other traditional ML models, as Badonyi and Marsh8 had already compared multiple approaches including SVM, LightGBM, Random Forest, Logistic Regression, and MLP, and identified SVM as the best performer. Based on these findings, we focused our comparisons on the strongest traditional baseline.
Model explanation
To study the importance of features, we utilized Integrated Gradients52 using Captum.53 Since this method works per sample, we applied it on correctly predicted samples in the test sets. We included samples with only one label for further interpretability. Finally, we averaged feature attributions across selected samples, and scaled them by dividing to the maximum attribution.
Proteome-wide inference
MOI and molecular mechanism inference
After selecting the final models for MOI and functional effect prediction, we predicted the MOI for all proteins in the PPI network (Table S5). Afterwards, we predicted the functional effect for the subset of proteins that were predicted as AD or ADAR (Table S6).
Enrichment analysis
To study further the predictions, we used GSEApy54 to perform enrichment analysis,55 which is a statistical method used to determine whether known biological functions or processes are over-represented in a protein list of interest (e.g. AD proteins). In this method, the enrichment significance is calculated based on the hypergeometric distribution, where p-value is the cumulative probability of observing at least k proteins of interest annotated to a specific protein set. The formula for the p-value is given by:
where N is the total number of proteins in the background distribution, M is the number of proteins in that distribution annotated to the gene set of interest, n is the size of the list of proteins of interest, and k is the number of proteins in that list which are annotated to the gene set. We focused on pathways containing at least 10 and at most 500 genes to exclude pathways that are either too specific or too general. As a reference database, we used Gene Ontology (Biological Processes)56^,^57 to understand functional landscape of proteins.
Quantification and statistical analysis
All analyses were performed using Python, with libraries including NumPy,58 pandas,59 scikit-learn,60 NetworkX,61 PyTorch,62 Captum,53 Graphein,16 and GSEApy.54 Model performance on test sets was evaluated using standard metrics, including score, precision, and recall. For pathway enrichment analysis, p-values were adjusted using the Benjamini–Hochberg method,63 with a significance threshold set at 0.05. Exact sample sizes are provided in the results section and detailed statistical methods and parameters are described in the relevant methods sections.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Zschocke J.Byers P.H.Wilkie A.O.M.Mendelian inheritance revisited: dominance and recessiveness in medical genetics Nat. Rev. Genet.2420234424633680620610.1038/s 41576-023-00574-0 · doi ↗ · pubmed ↗
- 2Veitia R.A.Exploring the etiology of haploinsufficiency Bioessays 2420021751841183528210.1002/bies.10023 · doi ↗ · pubmed ↗
- 3Backwell L.Marsh J.A.Diverse molecular mechanisms underlying pathogenic protein mutations: Beyond the loss-of-function paradigm Annu. Rev. Genom. Hum. Genet.23202247549810.1146/annurev-genom-111221-10320835395171 · doi ↗ · pubmed ↗
- 4Quinodoz M.Royer-Bertrand B.Cisarova K.Di Gioia S.A.Superti-Furga A.Rivolta C.DOMINO: Using machine learning to predict genes associated with dominant disorders Am. J. Hum. Genet.10120176236292898549610.1016/j.ajhg.2017.09.001PMC 5630195 · doi ↗ · pubmed ↗
- 5Petrazzini B.O.Balick D.J.Forrest I.S.Cho J.Rocheleau G.Jordan D.M.Do R.Ensemble and consensus approaches to prediction of recessive inheritance for missense variants in human disease Cell Rep. Methods 420241009143965768110.1016/j.crmeth.2024.100914 PMC 11704621 · doi ↗ · pubmed ↗
- 6Stein D.Kars M.E.Wu Y.BayrakÇ.S.Stenson P.D.Cooper D.N.Schlessinger A.Itan Y.Itan Y.Genome-wide prediction of pathogenic gain- and loss-of-function variants from ensemble learning of a diverse feature set Genome Med.1520231033803715510.1186/s 13073-023-01261-9PMC 10688473 · doi ↗ · pubmed ↗
- 7Gerasimavicius L.Livesey B.J.Marsh J.A.Loss-of-function, gain-of-function and dominant-negative mutations have profoundly different effects on protein structure Nat. Commun.13202238953579415310.1038/s 41467-022-31686-6PMC 9259657 · doi ↗ · pubmed ↗
- 8Badonyi M.Marsh J.A.Proteome-scale prediction of molecular mechanisms underlying dominant genetic diseases P Lo S One 192024 e 030731210.1371/journal.pone.0307312 PMC 1134102439172982 · doi ↗ · pubmed ↗
