BGT: efficient and flexible genotype query across many samples

Heng Li

arXiv:1506.08452·q-bio.GN·August 7, 2017·Bioinform.

BGT: efficient and flexible genotype query across many samples

Heng Li

PDF

TL;DR

BGT is a compact, high-performance tool and format for querying large-scale genomic genotype data across many samples in real-time, enabling efficient analysis of extensive genomic datasets.

Contribution

The paper introduces BGT, a novel format and tool that significantly improves the efficiency and flexibility of querying large genomic datasets across numerous samples.

Findings

01

Encodes 32,488 samples across 39.2 million SNPs in 7.4GB

02

Decodes hundreds of millions of genotypes per CPU second

03

Enables real-time complex genotype queries

Abstract

Summary: BGT is a compact format, a fast command line tool and a simple web application for efficient and convenient query of whole-genome genotypes and frequencies across tens to hundreds of thousands of samples. On real data, it encodes the haplotypes of 32,488 samples across 39.2 million SNPs into a 7.4GB database and decodes a couple of hundred million genotypes per CPU second. The high performance enables real-time responses to complex queries. Availability and implementation: https://github.com/lh3/bgt Contact: [email protected]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.