# HapScoreDB: a database of protein language model functional scores for haplotype-resolved protein sequences

**Authors:** Fabio Mazza, Filippo Gastaldello, Davide Dalfovo, Gianluca Lattanzi, Alessandro Romanel

PMC · DOI: 10.1093/nar/gkaf1184 · Nucleic Acids Research · 2025-11-20

## TL;DR

HapScoreDB is a database that provides functional scores for protein sequences with genetic variants, helping to understand how genetic differences affect protein function.

## Contribution

HapScoreDB introduces haplotype-resolved protein scores using language models, enabling better interpretation of genetic variation effects.

## Key findings

- Haplotypes with cancer GWAS variants show significantly reduced predicted fitness.
- Variability in haplotype scores highlights known cancer genes.
- The database includes over 130,000 haplotypes from 18,000 genes with fitness scores.

## Abstract

Deciphering the functional effects of genetic variants, especially those inherited together on the same haplotype, remains a major challenge in human genetics, where epistasis among co-occurring variants can further complicate interpretation. To address this, we present HapScoreDB, a database offering protein language model-derived scores for haplotype-resolved protein-coding sequences across all human transcript isoforms. Leveraging GENCODE and Ensembl annotations with phased variant data from the 1000 Genomes Project, HapScoreDB includes over 130 000 distinct protein haplotypes from >18 000 genes and 78 000 transcripts, encompassing over 94 000 coding variants. Fitness scores for each haplotype were computed using state-of-the-art protein language models. Preliminary analyses show that haplotypes harboring cancer GWAS variants tend to have significantly reduced predicted fitness. Moreover, variability in scores across haplotypes of the same transcript highlights known cancer genes, suggesting that dispersion in predicted fitness may capture functionally important variation. HapScoreDB features a user-friendly web interface for interactive exploration, visualization, and download of both full and customized datasets. As a dynamic and expandable platform, it connects real-world human genetic variation with advanced protein modeling, enabling novel approaches in variant interpretation, isoform prioritization, and population-scale functional genomics. Access HapScoreDB at https://bcglab.cibio.unitn.it/hapscoredb.

Graphical Abstract

## Linked entities

- **Diseases:** cancer (MONDO:0004992)
- **Species:** Homo sapiens (taxon 9606)

## Full-text entities

- **Diseases:** cancer (MESH:D009369)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12807696/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12807696/full.md

## References

53 references — full list in the complete paper: https://tomesphere.com/paper/PMC12807696/full.md

---
Source: https://tomesphere.com/paper/PMC12807696