BGT: efficient and flexible genotype query across many samples
Heng Li

TL;DR
BGT is a compact, high-performance tool and format for querying large-scale genomic genotype data across many samples in real-time, enabling efficient analysis of extensive genomic datasets.
Contribution
The paper introduces BGT, a novel format and tool that significantly improves the efficiency and flexibility of querying large genomic datasets across numerous samples.
Findings
Encodes 32,488 samples across 39.2 million SNPs in 7.4GB
Decodes hundreds of millions of genotypes per CPU second
Enables real-time complex genotype queries
Abstract
Summary: BGT is a compact format, a fast command line tool and a simple web application for efficient and convenient query of whole-genome genotypes and frequencies across tens to hundreds of thousands of samples. On real data, it encodes the haplotypes of 32,488 samples across 39.2 million SNPs into a 7.4GB database and decodes a couple of hundred million genotypes per CPU second. The high performance enables real-time responses to complex queries. Availability and implementation: https://github.com/lh3/bgt Contact: [email protected]
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
