Bridging Chaos Game Representations and $k$-mer Frequencies of DNA Sequences
Haoze He, Lila Kari, Pablo Millan Arias

TL;DR
This paper develops a mathematical framework linking Chaos Game Representations of DNA to their $k$-mer frequencies, enabling sequence reconstruction and data augmentation for genomic analysis.
Contribution
It introduces a formal connection between CGR and $k$-mer frequencies, and presents an algorithm for reconstructing sequences from $k$-mer distributions.
Findings
Proves equivalence between FCGR and discretized CGR at resolution $2^k$
Develops an algorithm for sequence reconstruction from $k$-mer profiles
Validates method on real and artificial genomic data
Abstract
This paper establishes formal mathematical foundations linking Chaos Game Representations (CGR) of DNA sequences to their underlying -mer frequencies. We prove that the Frequency CGR (FCGR) of order is mathematically equivalent to a discretization of CGR at resolution , and its vectorization corresponds to the -mer frequencies of the sequence. Additionally, we characterize how symmetry transformations of CGR images correspond to specific nucleotide permutations in the originating sequences. Leveraging these insights, we introduce an algorithm that generates synthetic DNA sequences from prescribed -mer distributions by constructing Eulerian paths on De Bruijn multigraphs. This enables reconstruction of sequences matching target -mer profiles with arbitrarily high precision, facilitating the creation of synthetic CGR images for applications such as data…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFractal and DNA sequence analysis · Genome Rearrangement Algorithms · Genomics and Phylogenetic Studies
