GenoBERT: A Language Model for Accurate Genotype Imputation

Lei Huang; Chuan Qiu; Kuan-Jui Su; Anqi Liu; Yun Gong; Weiqiang Lin; Lindong Jiang; Chen Zhao; Meng Song; Jeffrey Deng; Qing Tian; Zhe Luo; Ping Gong; Hui Shen; Chaoyang Zhang; and Hong-Wen Deng

arXiv:2604.00058·q-bio.GN·April 2, 2026

GenoBERT: A Language Model for Accurate Genotype Imputation

Lei Huang, Chuan Qiu, Kuan-Jui Su, Anqi Liu, Yun Gong, Weiqiang Lin, Lindong Jiang, Chen Zhao, Meng Song, Jeffrey Deng, Qing Tian, Zhe Luo, Ping Gong, Hui Shen, Chaoyang Zhang, and Hong-Wen Deng

PDF

TL;DR

GenoBERT is a transformer-based, reference-free genotype imputation model that achieves high accuracy across diverse datasets and ancestry groups, outperforming traditional methods especially at high missingness levels.

Contribution

It introduces GenoBERT, a novel transformer-based framework that captures linkage disequilibrium without relying on reference panels, improving imputation accuracy and robustness.

Findings

01

GenoBERT outperforms baseline methods in accuracy across datasets and missingness levels.

02

High imputation accuracy ($r^2 approx 0.98$) at 25% missing data.

03

Maintains robust performance ($r^2 > 0.90$) even at 50% missingness.

Abstract

Genotype imputation enables dense variant coverage for genome-wide association and risk-prediction studies, yet conventional reference-panel methods remain limited by ancestry bias and reduced rare-variant accuracy. We present Genotype Bidirectional Encoder Representations from Transformers (GenoBERT), a transformer-based, reference-free framework that tokenizes phased genotypes and uses a self-attention mechanism to capture both short- and long-range linkage disequilibrium (LD) dependencies. Benchmarking on two independent datasets including the Louisiana Osteoporosis Study (LOS) and the 1000 Genomes Project (1KGP) across ancestry groups and multiple genotype missingness levels (5-50%) shows that GenoBERT achieves the highest overall accuracy compared to four baseline methods (Beagle5.4, SCDA, BiU-Net, and STICI). At practical sparsity levels (up to 25% missing), GenoBERT attains high…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.