An Allele-Centric Pan-Graph-Matrix Representation for Scalable Pangenome Analysis
Roberto Garrone

TL;DR
This paper introduces an allele-centric pan-graph-matrix representation for scalable pangenome analysis, enabling efficient encoding of genetic variation and haplotype information across large cohorts.
Contribution
The authors present H1 and H2, novel representations that improve storage efficiency and explicitly encode haplotype ordering, advancing pangenome analysis methods.
Findings
Achieves near-optimal storage for genetic variants
Provides substantial compression gains for structural variants
Maintains exact information content while improving scalability
Abstract
Population-scale pangenome analysis increasingly requires representations that unify single-nucleotide and structural variation while remaining scalable across large cohorts. Existing formats are typically sequence-centric, path-centric, or sample-centric, and often obscure population structure or fail to exploit carrier sparsity. We introduce the H1 pan-graph-matrix, an allele-centric representation that encodes exact haplotype membership using adaptive per-allele compression. By treating alleles as first-class objects and selecting optimal encodings based on carrier distribution, H1 achieves near-optimal storage across both common and rare variants. We further introduce H2, a path-centric dual representation derived from the same underlying allele-haplotype incidence information that restores explicit haplotype ordering while remaining exactly equivalent in information content. Using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenetic Associations and Epidemiology · Bioinformatics and Genomic Networks · Genomics and Rare Diseases
