GenoBERT: A Language Model for Accurate Genotype Imputation
Lei Huang, Chuan Qiu, Kuan-Jui Su, Anqi Liu, Yun Gong, Weiqiang Lin, Lindong Jiang, Chen Zhao, Meng Song, Jeffrey Deng, Qing Tian, Zhe Luo, Ping Gong, Hui Shen, Chaoyang Zhang, and Hong-Wen Deng

TL;DR
GenoBERT is a transformer-based, reference-free genotype imputation model that achieves high accuracy across diverse datasets and ancestry groups, outperforming traditional methods especially at high missingness levels.
Contribution
It introduces GenoBERT, a novel transformer-based framework that captures linkage disequilibrium without relying on reference panels, improving imputation accuracy and robustness.
Findings
GenoBERT outperforms baseline methods in accuracy across datasets and missingness levels.
High imputation accuracy ($r^2 approx 0.98$) at 25% missing data.
Maintains robust performance ($r^2 > 0.90$) even at 50% missingness.
Abstract
Genotype imputation enables dense variant coverage for genome-wide association and risk-prediction studies, yet conventional reference-panel methods remain limited by ancestry bias and reduced rare-variant accuracy. We present Genotype Bidirectional Encoder Representations from Transformers (GenoBERT), a transformer-based, reference-free framework that tokenizes phased genotypes and uses a self-attention mechanism to capture both short- and long-range linkage disequilibrium (LD) dependencies. Benchmarking on two independent datasets including the Louisiana Osteoporosis Study (LOS) and the 1000 Genomes Project (1KGP) across ancestry groups and multiple genotype missingness levels (5-50%) shows that GenoBERT achieves the highest overall accuracy compared to four baseline methods (Beagle5.4, SCDA, BiU-Net, and STICI). At practical sparsity levels (up to 25% missing), GenoBERT attains high…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
