VQDNA: Unleashing the Power of Vector Quantization for Multi-Species Genomic Sequence Modeling
Siyuan Li, Zedong Wang, Zicheng Liu, Di Wu, Cheng Tan, Jiangbin Zheng,, Yufei Huang, Stan Z. Li

TL;DR
VQDNA introduces a novel vector quantization-based framework for genome tokenization, enabling adaptive, pattern-aware embeddings that improve genome modeling and reveal biologically significant mutation patterns.
Contribution
It proposes VQDNA with hierarchical residual quantization for improved genome vocabulary learning and demonstrates its effectiveness across multiple datasets.
Findings
Outperforms existing genome language models in accuracy and efficiency
Reveals biologically meaningful mutation patterns in SARS-CoV-2
Enriches genome vocabulary with hierarchical codebooks
Abstract
Similar to natural language models, pre-trained genome language models are proposed to capture the underlying intricacies within genomes with unsupervised sequence modeling. They have become essential tools for researchers and practitioners in biology. However, the hand-crafted tokenization policies used in these models may not encode the most discriminative patterns from the limited vocabulary of genomic data. In this paper, we introduce VQDNA, a general-purpose framework that renovates genome tokenization from the perspective of genome vocabulary learning. By leveraging vector-quantized codebooks as learnable vocabulary, VQDNA can adaptively tokenize genomes into pattern-aware embeddings in an end-to-end manner. To further push its limits, we propose Hierarchical Residual Quantization (HRQ), where varying scales of codebooks are designed in a hierarchy to enrich the genome vocabulary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Molecular Biology Techniques and Applications · Chromosomal and Genetic Variations
