When repeats drive the vocabulary: a Byte-Pair Encoding analysis of T2T primate genomes
Marina Popova, Iaroslav Chelombitko, Aleksey Komissarov

TL;DR
This study applies Byte Pair Encoding to primate genomes to analyze shared and unique sequences, revealing limitations of BPE in comparative genomics due to repetitive elements, and suggests hybrid strategies for improved tokenization.
Contribution
It introduces dnaBPE, a custom BPE tokenizer for genomes, and provides insights into its effectiveness and limitations in comparative genomics of primates.
Findings
Only 11,569 tokens shared across all genomes
High number of unique tokens per genome (~991,854)
Phylogenetic trees based on token overlap do not match known relationships
Abstract
The emergence of telomere-to-telomere (T2T) genome assemblies has opened new avenues for comparative genomics, yet effective tokenization strategies for genomic sequences remain underexplored. In this pilot study, we apply Byte Pair Encoding (BPE) to nine T2T primate genomes including three human assemblies by training independent BPE tokenizers with a fixed vocabulary of 512,000 tokens using our custom tool, dnaBPE. Our analysis reveals that only 11,569 tokens are shared across all assemblies, while nearly 991,854 tokens are unique to a single genome, indicating a rapid decline in shared vocabulary with increasing assembly comparisons. Moreover, phylogenetic trees derived from token overlap failed to recapitulate established primate relationships, a discrepancy attributed to the disproportionate influence of species-specific high-copy repetitive elements. These findings underscore the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Phylogenetic Studies · Language and cultural evolution · RNA and protein synthesis mechanisms
MethodsByte Pair Encoding
