Missense variants pathogenicity annotation from homologous proteins

Gabriel Ruiz-Alías; Sergi Soldevila; Xavier Altafaj; Arnau Cordomí; Mireia Olivella

PMC · DOI:10.1093/bioinformatics/btaf305·May 14, 2025

Missense variants pathogenicity annotation from homologous proteins

Gabriel Ruiz-Alías, Sergi Soldevila, Xavier Altafaj, Arnau Cordomí, Mireia Olivella

PDF

Open Access

TL;DR

This paper introduces HomolVar, a tool that predicts the harmfulness of genetic variants by analyzing similar variants in related proteins, improving diagnosis of genetic disorders.

Contribution

The novel contribution is using homologous variants to predict pathogenicity with high accuracy, which current tools do not fully capture.

Findings

01

Pathogenicity can be extrapolated with 95% accuracy within protein families.

02

Current mutation prediction tools do not fully capture the biological feature of homologous variant pathogenicity.

03

HomolVar achieves up to 98% accuracy for closer homologs.

Abstract

High-throughput DNA sequencing has revealed millions of single nucleotide variants (SNVs) in the human genome, with a small fraction linked to disease. The effect of missense variants, which alter the protein sequence, is particularly challenging to interpret due to the scarcity of clinical annotations and experimental information. While using conservation and structural information, current prediction tools still struggle to predict variant pathogenicity. In this study, we explored the pathogenicity of homologous missense variants—variants in equivalent positions across homologous proteins—focusing on proteins involved in autosomal dominant diseases. Our analysis of 2976 pathogenic and 17 555 non-pathogenic homologous variants demonstrated that pathogenicity can be extrapolated with 95% accuracy within a family, or up to 98% for closer homologs. Remarkably, the evaluation of 27…

Linked entities

Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.

Species1

Homo sapiens(human · species)

Diseases1

autosomal dominant diseases

Figures2

Click any figure to enlarge with its caption.

Top-left: structural superimposition of SCN4A [in dark blue; PDB ID: 6AGF (Pan et al. 2018)] and SCN5A [in purple; PDB ID: 6LQA (Li et al. 2021)], showing Asn440 and Asn406 as orange spheres. Bottom-left: sequence alignment of the Ion Channel domain for SCN4A and SCN5A. Right: Variants p.Asn440Lys in SCN4A and p.Asn410Lys in SCN5A (as orange sticks) are pointing toward the center of the pore channel.

Overview of the HomolVar interface and functionality. (top-left) Input panel where users can specify a protein and optionally a specific variant. (top-right and bottom) mosaic of output panels consisting of prediction, the list of mapped homologous pairs and an interactive plot displaying the variants already annotated for the protein and the homologous variants.

Tables2

Table 1.. Comparison of genetic variants pathogenicity between pairs of homologous variants.a

		N-N (TN)	P-P (TP)	N-P (FN+FP)	Total pairs	Sensitivity	Specificity	Accuracy	MCC
Strict pairs	All	1612	870	127	2609	0.93	0.96	0.95	0.89
Strict pairs	>30% SI	632	823	30	1485	0.98	0.98	0.98	0.96
Similar mutated aa	All	2164	1433	178	3775	0.94	0.96	0.95	0.90
Similar mutated aa	>30% SI	909	1376	60	2345	0.98	0.97	0.97	0.95
Similar reference aa	All	1650	872	150	2672	0.92	0.96	0.94	0.88
Similar reference aa	>30% SI	556	824	33	1413	0.98	0.97	0.98	0.95
Similar reference and mutated aa	All	3074	1444	266	4784	0.92	0.96	0.94	0.87
Similar reference and mutated aa	>30%SI	1083	1381	89	2553	0.97	0.96	0.97	0.93

Table 2.. Predictive power of some commonly used mutation compared to HomolVar.

Mutation predictor method	Predictive power
HomolVar	0.95
gMVP	0.79
PrimateAI	0.79
AlphaMissense	0.76
SIFT	0.58
FathMM	0.53
PolyPhen-2	0.50
PhyloP17way_primate	0.43
CADD	0.38

Funding2

—Instituto de Salud Carlos III10.13039/501100004587
—European Union10.13039/501100000780

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenomics and Rare Diseases · Genetic Associations and Epidemiology · Cancer Genomics and Diagnostics

Full text

1 Introduction

Over the last few years, high-throughput DNA sequencing technologies have allowed the retrieval of hundreds of thousands of human genome and exome sequences, shedding light on individual genetic variation. A typical individual differs at 4.1–5.0 million sites compared to the reference human genome (Auton et al. 2015). Single nucleotide variations (SNV) and short indels constitute 99.9% of these variations (Gudmundsson et al. 2022). While most SNVs occur in non-coding regions or do not alter the amino acid sequence, missense SNVs modify the protein sequence. Consequently, missense variants can affect protein structure and function, eventually leading to disease.

Of the 4 million missense variants identified in the human genome, only 2% have a clinical annotation (pathogenic or non-pathogenic) (Cheng et al. 2023). In an individual’s exome, there are, on average, 200 very rare missense variants, for which it is unlikely to have clinical annotations. This includes 14–40 novel missense variants absent in other human exomes (Gudmundsson et al. 2022). Discerning the variants responsible for a specific condition amidst the benign variation is extremely challenging, limiting the diagnosis and therapeutic intervention in rare diseases.

For missense variants in proteins involved in autosomal dominant diseases, there is a clear correlation between the effect of a single variant in the protein structure, the protein function, and the clinical phenotype (pathogenic or non-pathogenic). Thus, numerous efforts are being made to detect the consequences of missense variants by assessing their potential to modify the structure and function of the protein. These predictions are mainly based on the phylogenetic conservation of the region containing the variant and how changes may affect protein structure and function. The first popular algorithms in this regard were SIFT (Vaser et al. 2016) and PROVEAN (Choi and Chan, 2015), which have been standard bearers in predicting the potential impact of variants. Other tools, such as PolyPhen-2 (Adzhubei et al. 2013), incorporate structural parameters for variant classification. M-CAP (Jagadeesh et al. 2016) and, more recently, AlphaMissense (Tordai et al. 2024) added machine learning-based approaches. Despite these tools having improved our knowledge about variant pathogenicity, they still have many limitations, highlighting the hard nature of this task.

In the context of GRIN-related disorders, we have recently demonstrated that homologous missense variants for proteins encoded by GRIN genes have identical clinical annotations (Santos-Gómez et al. 2022). Homologous variant extrapolation within this family of proteins, involved in a rare Mendelian disease, allowed duplicating the number of functionally and clinically annotated variants.

In the present study, we scaled up our previous analysis to all human proteins involved in autosomal dominant diseases. We show that in 95%–98% of the homologous pairs of variants that we could identify, they coincided in their annotations. Therefore, our findings demonstrate that pathogenicity can be safely extrapolated between homologous variants. Our results represent an important expansion of genetic variants’ annotation repertoire, providing a direct clinical impact on genotype-phenotype assessment in patients with rare genetic disorders.

2 Materials and methods

2.1 Build-up of pathogenic and non-pathogenic missense variant datasets

We took all reviewed human proteins in the UniProt (The UniProt Consortium 2023) as of 06 April 2023 (20 422 proteins) and exclusively kept those encoded by genes with an autosomal dominant (AD) inheritance pattern, according to OMIM (Amberger et al. 2019). From these, we selected only pathogenic proteins, defined as proteins with at least three reported pathogenic missense variants in ClinVar (Landrum et al. 2020). The final set resulted in 1282 pathogenic proteins (see Supplementary Table S1).

To construct the pathogenic missense variants dataset, all non-somatic missense disease-causing/pathogenic and likely pathogenic variants from ClinVar (06 April 2023) were collected for each protein. To create the non-pathogenic missense variants dataset, we retrieved all missense variants from gnomAD v4 (Chen et al. 2024) for the same group of proteins. To balance the number of pathogenic and non-pathogenic variants and to minimize the possibility of taking pathogenic rare variants yet to be identified, we generated a collection of datasets discarding variants with allele frequencies up to various thresholds (see Supplementary Table S2). Based on these results, the final analysis was performed by discarding gnomAD variants with an allele frequency below 10^−6^. In addition, we discarded variants with unclear consequences and those labelled as pathogenic in ClinVar and present also in gnomAD (assumed non-pathogenic). The latter are probably neutral variants that were wrongly identified as disease-causing in patients with a disorder due to the lack of these variants in the healthy population. Still, we opted to discard them provided that the non-pathogenic set was large enough. The obtained datasets contained 28 888 pathogenic missense variants (see Supplementary Table S3) and 310 268 non-pathogenic missense variants (see Supplementary Table S4), all involving pathogenic proteins (see above).

2.2 Identification of homologous variants in pathogenic and non-pathogenic missense variant datasets

We used Pfam (Paysan-Lafosse et al. 2023) family multiple sequence alignments to identify equivalent positions between homologous proteins to those present in our previously described datasets of missense variants in pathogenic proteins. In particular, manually curated seed alignments (Sonnhammer et al. 1998) were selected instead of non-manually curated full alignments. We considered two different criteria for homologous variants: (i) those that imply the same amino acid change at the same equivalent position (strict pairs), or (ii) those involving similar amino acid change (positive score in the BLOSUM62 substitution matrix, see Supplementary Table S5) at the same equivalent position (similar pairs). The BLOSUM62 was chosen as the average sequence similarity between the proteins containing homologous variants was around 60%. To discard the possible influence of the choice of the substitution matrix, we repeated the analysis using BLOSUM45 and BLOSUM80 matrices as well.

2.3 Web server

The HomolVar web application was constructed using a Python backend (v.3.10.12) with the Flask framework (v.3.0.3). The application was deployed using Apache v.2.4.52.

2.4 Comparison of HomolVar to other mutation predictors

For comparison purposes, the pathogenicity of all pairs of homologous variants was predicted using dbNSFP (Liu et al. 2016), which includes SIFT (Vaser et al. 2016), PolyPhen-2 (Adzhubei et al. 2013), AlphaMissense (Tordai et al. 2024), MutationTaster (Schwarz et al. 2014), MutationAssessor (Reva et al. 2011), PROVEAN (Choi and Chan, 2015), VEST (Carter et al. 2013), M-CAP (Jagadeesh et al. 2016), REVEL (Ioannidis et al. 2016), MVP (Qi et al. 2018), gMVP (Zhang et al. 2022), PrimateAI (Sundaram et al. 2018), deogen2 (Raimondi et al. 2017), ClinPred (Alirezaie et al. 2018), LIST-S2 (Malhis et al. 2020), ESM1b (Brandes et al. 2023), MutScore (Quinodoz et al. 2022), CADD (Kircher et al. 2014), DANN (Quang et al. 2015), fathmm-XF (Rogers et al. 2018), BayesDel (Feng, 2017), MetaSVM (Kim et al. 2017), MetaLR (Chen et al. 2023), MetaRNN (Li et al. 2022), VARITY (Wu et al. 2021), phyloP (Pollard et al. 2010), and phastCons (Siepel et al. 2005).

3 Results

3.1 Comparing pathogenicity on pairs of homologous variants

In this study, we explored the hypothesis that similar variants affecting equivalent positions of homologous proteins, the so-called “homologous variants”, result in conserved pathogenicity. We focused the analysis on “pathogenic proteins,” defined as proteins linked to autosomal dominant diseases. In these proteins, there is a strong correlation between the effect of a single missense variant in the protein structure and function, as well as with the clinical outcome (pathogenic or non-pathogenic). Thus, this group excludes proteins fulfilling one or more of the following criteria: (i) not primarily related to human diseases, (ii) encoded by genes with recessive inheritance, and/or (iii) linked to complex diseases, a scenario that would mitigate the interpretation of the functional effect of the genetic variant.

Methodologically, we used family alignments to map equivalent positions within homologous proteins and publicly available missense variant data [such as ClinVar (Landrum et al. 2020) and gnomAD (Chen et al. 2024)] to test our hypothesis. Due to sequence conservation heterogeneity, only those variants located within conserved regions (i.e. available sequence alignments with the corresponding homologous proteins) were analyzed. This resulted in a collection of 2976 pathogenic and 17 555 non-pathogenic variants distributed along 1282 proteins and 292 Pfam family domains (see Methods) (see Supplementary Table S6).

We first analyzed variants in equivalent positions with the same amino acid change (named “strict pairs,” Table 1). We found 2482 (95%) coincident pairs of homologous variants, that is, presenting the same annotation [1612 with both variants annotated as non-pathogenic (N-N), and 870 variants annotated as pathogenic (P-P)]. Only 127 homologous pairs presented discordant annotations (N-P; one non-pathogenic and one pathogenic) (see Table 1 and Supplementary Table S7). The computed classification evaluation metrics [sensitivity, specificity, accuracy, and Matthews correlation coefficient (MCC)] showed all high values, near 1. These data strongly supported the hypothesis that the pathogenicity of variants lacking annotations can be inferred from annotated variants in homologous positions with high predictive power. Moreover, as the quality of multiple sequence alignments increases, the number of annotations that can be extrapolated between homologous variants will also increase. In terms of specificity, the results are in agreement with a preprint manuscript that used a similar approach on paralogous variants using data exclusively from ClinVar (Li et al. 2023). In that work, however, they obtained a very low sensitivity, probably due to the lack of a large dataset of non-pathogenic variants –such as the one employed in the present study-, and an appropriate distinction between variants in pathogenic proteins from those on non-pathogenic proteins.

It is reasonable to speculate that discrepancies in non-coincident pairs may be due to distantly related homologous proteins or to the misalignment of homologous sequences that also occur at lower sequence identities. To verify this hypothesis, we discarded those protein pairs with <30% sequence identity in their aligned domains. The subsequent analysis revealed that the number of homologous variant pairs with identical pathogenicity increased to 98% (Table 1), although with the penalty of decreasing the number of pairs. These results showed that high protein conservation is directly linked with the reliability of annotation extrapolation. Moreover, these results suggest that developing robust sequence alignments within a protein family curated by experts and eventually incorporating structural information would improve the performance of pathogenesis extrapolation.

Despite the high number of missense variants identified in the human genome, those with an associated clinical annotation are still scarce. Therefore, to increase the number of annotations through homologous variant extrapolation, we checked whether variants could also be extrapolated, not only for the same variation but also for similar ones, that is, amino acids with similar physicochemical properties. We compared pairs of homologous variants presenting a similar reference and/or a similar mutated amino acid according to BLOSUM62 scores, consistent with an average sequence similarity between protein pairs of 60%. Two amino acids are considered similar when their substitution score in the matrix is >0. For instance, Leu would be considered similar to Ile (score = 2), Met (score = 2) and Val (score = 1) (see Supplementary Table S5).

The results showed that by incorporating similar mutated amino acid changes, the dataset increased while not affecting the classification evaluation metrics (see Table 1 and Supplementary Table S8). In contrast, by incorporating similar reference amino acids, the number of pairs remained almost the same as for strict pairs, and the statistical descriptors were slightly reduced (see Table 1 and Supplementary Table S9). This might be explained by the fact that the effect of a missense variant is more dependent on the physicochemical properties of the mutated amino acid rather than the reference amino acid. The advantage of incorporating similar reference and mutated amino acids is that it almost duplicates the dataset size without compromising the comparison evaluation metrics (see Table 1 and Supplementary Table S10). We repeated the same analysis using BLOSUM45 and BLOSUM80 substitution matrices to evaluate the possible effect of the substitution matrix on evaluating similarity. These matrices are appropriate for proteins with lower or higher sequence identity, respectively. The results confirmed that the choice of matrix did not notably impact the outcomes (see Supplementary Table S11).

We asked ourselves if domains could also be classified as pathogenic and non-pathogenic or if their pathogenesis was related to the pathogenicity of the protein. It is estimated that 70% of all human proteins present more than one domain (Sonnhammer et al. 1998). For all domains containing disease-causing variants, we assessed if the same domain was also present in non-pathogenic proteins. Supplementary Table S12 shows that these domains are either present in pathogenic and non-pathogenic proteins, suggesting that the pathogenicity is not associated with specific domains and that it is associated with specific proteins. Thus, if a variant affects the function of a domain, this variant will be disease-causing if the domain is contained in pathogenic proteins (proteins involved in autosomal dominant inheritance disorders) but will be neutral if the domain is contained in proteins not involved in any disease, involved only in complex diseases or follow an X-linked or recessive inheritance.

3.2 Homologous variant example: Ion transport family domain

To illustrate the basis of the pathogenicity extrapolation between homologous variants, we thoroughly characterized discrete missense mutations affecting the Ion Transport protein family domain (PF00520). Although this domain is present in 102 human proteins, curated multiple sequence alignments (seed Pfams) are only available for 14 pathogenic and 9 non-pathogenic proteins. For one of its members, SCN4A, variant p. Asn440Lys has been associated with paramyotonia congenita, a channelopathy altering muscle contraction (Lehmann-Horn et al. 2011, Lossin et al. 2012). The structure of this channel (Fig. 1) shows that residue Asn440 points towards the center of the pore channel. This orientation suggests that the positive charge introduced by mutation p. Asn440Lys would modify channel gating, altering Na+ influx and ultimately affecting membrane potential. Figure 1 displays the sequence alignment between SCN4A-encoded protein and its paralogous SCN5A gene product, sharing 66% sequence identity. The structure superimposition of the two sodium channels is almost fully coincident, with a root mean squared standard deviation of 1.0 Å. The equivalent position of SCN4A(p.Asn440) corresponds to SCN5A(p.Asn406), and is located within a conserved region. Consequently, the residues appear in the same position, at the center of the ion channel pore, and might have similar functional outcomes, namely a disturbance of channel gating properties. In line with the conserved topology and structural-based prediction, SCN5A(p.Asn406Lys) variant has also been reported to be disease-associated, causing a cardiac channelopathy (Tester et al. 2005). The same homologous variant is also found, annotated as pathogenic, in SCN1A, SCN9A, and CAC1A. According to our results, the pathogenicity of these variants can be extrapolated to predict as pathogenic 43 potential variants that mutate to Lys in the same equivalent position in the PF000520 Multiple Sequence Alignment, in the 14 proteins that are associated with autosomal dominant inheritance. Furthermore, if we extend this prediction to variants that result in similar amino acids, an additional 132 variants across these 14 proteins can be predicted as pathogenic. Consequently, one single variant, SCN4A(p.Asn440Lys) allows us to predict the pathogenesis of 14 identical homologous potential variants and, in addition, 132 similar homologous potential variants (see Supplementary Table S13).

Top-left: structural superimposition of SCN4A [in dark blue; PDB ID: 6AGF (Pan et al. 2018)] and SCN5A [in purple; PDB ID: 6LQA (Li et al. 2021)], showing Asn440 and Asn406 as orange spheres. Bottom-left: sequence alignment of the Ion Channel domain for SCN4A and SCN5A. Right: Variants p.Asn440Lys in SCN4A and p.Asn410Lys in SCN5A (as orange sticks) are pointing toward the center of the pore channel.

On the other hand, three identical homologous variants are present in the same position of the alignment for seven additional members of the Transport protein family domain that do not follow an autosomal dominant inheritance (see Supplementary Table S13). These variants are present in gnomAD and are classified as non-disease-causing. Although these variants may affect the structure and/or the function of the proteins, these proteins are associated with autosomal recessive, X-linked inheritance, only affected by copy number variations or not related to any disease. In accordance with our results, pathogenic homologous variants can also be used to identify homologous variants in proteins not related to an autosomal dominant inheritance that are affecting the structure and/or function of the protein, but that are not disease-causing per se and are found in healthy population.

3.3 HomolVar web server

To facilitate the task of encountering homologous variants and their annotations, we developed a web server that computationally predicts the pathogenesis of missense variants based on available pathogenesis annotations of homologous variants. HomolVar aligns the query variant provided by the user to the corresponding multiple sequence alignments to identify homologous variants (either identical or similar) with available annotations. These are used to classify the variants as damaging or not damaging the protein structure and function. If the variant is predicted as damaging the protein structure and function and the protein is involved in a monogenic disease, the variant is predicted as disease-causing. If the protein is not involved in a monogenic disease, then the variant is predicted as non-disease-causing because, even though it may alter protein structure and function, this is not enough to cause pathogenicity. For each query (see Fig. 2 for an example), which may be a specific variant or the whole gene/protein, the output displays the predictions together with all available annotations for homologous variants. A full protein schematic interactive representation of all the annotated variants and homologous variants along the sequence is also displayed.

Overview of the HomolVar interface and functionality. (top-left) Input panel where users can specify a protein and optionally a specific variant. (top-right and bottom) mosaic of output panels consisting of prediction, the list of mapped homologous pairs and an interactive plot displaying the variants already annotated for the protein and the homologous variants.

3.4 Comparison of HomolVar to other mutation predictors

Next, we questioned if this biological feature of homologous variants is worth considering explicitly or if it is already captured by the commonly used algorithms for variant prediction. For this purpose, we compared HomolVar prediction power to 27 mutation predictors for homologous variants (see Table 2, Supplementary Tables S14 and 15). The successful prediction rate for mutation prediction servers ranged between 34% and 79%, approximately 54% on average. The low pathogenicity prediction rates in the mutation prediction tools show that the concept of homology within variants is not considered or properly captured by these mutation prediction servers either explicitly or implicitly. The results claim the need to incorporate homologous annotations. Yet, it should be taken into account that these results are already overestimated due to circularity problems (Grimm et al. 2015). Circularity is due to overlapping variants in the training and evaluation dataset and also to protein overlap when variants from the same protein appear in the training and evaluation dataset. It has been extensively discussed that mutation predictors that are based on phylogenetic trees rather than multiple sequence alignments present a better prediction power (Siepel et al. 2005, Pollard et al. 2010, Dereli et al. 2024). However, this is not observed in our dataset, where methods based on phylogenetic trees do not show a better performance.

Predictive power of some commonly used mutation predictor methods compared to HomolVar (between 0 and 1). The performance is computed as the correct number of variants pathogenicity predictions divided by the total number of variants, for all pairs of identical homologous variants.

3.5 Strengths and limitations of HomolVar

We have previously used homologous annotations to predict the pathogenesis of unannotated variants within specific families of proteins (Santos-Gómez et al. 2022). In the present study, we have extended this assumption to all proteins in the human proteome. Although it is expected that sequence similarity implies similarity in the variant label, an appropriate assessment had, to the best of our knowledge, not been performed before. HomolVar incorporates this biological feature, which is not properly captured by current mutation prediction algorithms.

An important feature we have set in HomolVar is distinguishing between variants located in pathogenic proteins or non-pathogenic proteins. That is, although a variant can affect the function of a protein (damaging variants), if the protein is not involved in an autosomal dominant inheritance disease, then the variant can be found in a healthy population, and the variant is not disease-causing. In general, this assumption is not considered in mutation predictors. Thus, the training sets mix variants in pathogenic and non-pathogenic proteins, decreasing its prediction power. The inclusion of proteins involved in autosomal recessive disorders, X-linked disorders or complex diseases implies having a dataset where there is no direct correlation between the effect of a change in the structure and function of a protein and the corresponding clinical phenotype. Moreover, we propose that mutation predictors should differentiate between predicting that a variant is affecting the structure and/or function of the protein (damaging/not damaging) and additionally, if the variant is responsible for a clinical phenotype (disease-causing/non-disease-causing).

One of the current limitations of HomolVar is the number of available annotated pathogenic missense variants and non-pathogenic missense variants in pathogenic proteins, which reduces its coverage. Another limitation that also reduces the coverage is the lack of high-quality multiple sequence alignments for all positions in all human proteins and all its members. However, due to the expected increase in genome and exome sequencing and to the availability of improved multiple sequence alignments for all human proteins, it is expected that the coverage and the prediction power of HomolVar will increase in future years. HomolVar web server is ready to incorporate both upcoming new variants and user-provided multiple sequence alignments at different levels of protein classification.

4 Conclusions

Our analysis demonstrates that the pathogenicity within homologous variants in pathogenic proteins is coincident within homologous variants with an accuracy of 95%. This applies to both identical homologous variants (same reference and mutated amino acids) and also to similar homologous variants (reference and mutated amino acids with similar physicochemical properties). We also show that the accuracy can reach up to 98% when imposing a minimum threshold of 30% sequence identity within the protein domain.

The present findings can be useful for extrapolating the pathogenicity of annotated variants into non-annotated homologous variants, broadening the number of annotations. Although the hypothesis we have tested is quite simple, we have observed that this information is not being considered in commonly used mutation predictors, which reach moderate prediction rates between 34 and 79%, when considering variations with available annotations in the original protein and homologous ones. Still, these values may be overestimated due to the circularity of variants in the training and test sets.

Considering the scarce number of variants with pathogenicity annotations, these findings are clinically highly relevant, with outstanding importance in the context of patient diagnosis and personalized therapies. Additionally, we expect that the concept of homologous missense variants is valid beyond the context used in the present study. In proteins involved in complex diseases or recessive disorders, the pathogenicity annotations from missense variants in pathogenic proteins could be used to understand their putative aetiological contribution. In the same direction, these extrapolations would be potentially used to annotate somatic mutations or to select a conserved variant in orthologous proteins for the design and generation of transgenic models of human genetic disorders.

We created a web server (HomolVar, available at https://rarevariants.org/HomolVar/) that facilitates this task of exploring homologous variations and their annotations.

Supplementary Material

btaf305_Supplementary_Data

Bibliography46

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Adzhubei I , Jordan DM, Sunyaev SR. Predicting functional effect of human missense mutations using Poly Phen-2. Curr Protoc Hum Genet 2013;20:1–50.10.1002/0471142905.hg 0720 s 76PMC 448063023315928 · doi ↗ · pubmed ↗
2Alirezaie N , Kernohan KD, Hartley T et al Clin Pred: prediction tool to identify disease-relevant nonsynonymous single-nucleotide variants. Am J Hum Genet 2018;103:474–83.30220433 10.1016/j.ajhg.2018.08.005PMC 6174354 · doi ↗ · pubmed ↗
3Amberger JS , Bocchini CA, Scott AF et al OMIM.org: leveraging knowledge across phenotype-gene relationships. Nucleic Acids Res 2019;47:D 1038–43.30445645 10.1093/nar/gky 1151 PMC 6323937 · doi ↗ · pubmed ↗
4Auton A , Brooks LD, Durbin RM et al A global reference for human genetic variation. Nature 2015;526:68–74.26432245 10.1038/nature 15393 PMC 4750478 · doi ↗ · pubmed ↗
5Brandes N , Goldman G, Wang CH et al Genome-wide prediction of disease variant effects with a deep protein language model. Nat Genet 2023;55:1512–22.37563329 10.1038/s 41588-023-01465-0PMC 10484790 · doi ↗ · pubmed ↗
6Carter H , Douville C, Stenson PD et al Identifying Mendelian disease genes with the variant effect scoring tool. BMC Genomics 2013;14 Suppl 3:S 3.10.1186/1471-2164-14-S 3-S 3PMC 366554923819870 · doi ↗ · pubmed ↗
7Chen S , Francioli LC, Goodrich JK et al A genomic mutational constraint map using variation in 76,156 human genomes. Nature 2024;625:92–100.38057664 10.1038/s 41586-023-06045-0PMC 11629659 · doi ↗ · pubmed ↗
8Chen Y , Liu L, Li J, et al Meta LR: meta-tuning of learning rates for transfer learning in medical imaging. In: Greenspan H et al (ed.), Medical Image Computing and Computer Assisted Intervention—MICCAI 2023. Cham: Springer Nature Switzerland, 2023, 706–16.