Fractal Language Modelling by Universal Sequence Maps (USM)
Jonas S Almeida, Daniel E Russ, Susana Vinga, Ines Duarte, Lee Mason, Praphulla Bhawsar, Aaron Ge, Arlindo Oliveira, Jeya Balaji Balasubramanian

TL;DR
This paper introduces Universal Sequence Maps (USM), a fractal encoding method for symbolic sequences that preserves context at multiple scales, enabling efficient numerical analysis and convergence properties, demonstrated on genomic data.
Contribution
The paper advances bijective fractal encoding with USM by resolving seeding biases and revealing its convergence to a steady state, applicable to sequences of arbitrary alphabet size.
Findings
USM provides a bijective, fractal encoding of sequences.
USM converges to a steady state embedding.
Application demonstrated on genomic sequences with potential for arbitrary alphabets.
Abstract
Motivation: With the advent of Language Models using Transformers, popularized by ChatGPT, there is a renewed interest in exploring encoding procedures that numerically represent symbolic sequences at multiple scales and embedding dimensions. The challenge that encoding addresses is the need for mechanisms that uniquely retain contextual information about the succession of individual symbols, which can then be modeled by nonlinear formulations such as neural networks. Context: Universal Sequence Maps(USM) are iterated functions that bijectively encode symbolic sequences onto embedded numerical spaces. USM is composed of two Chaos Game Representations (CGR), iterated forwardly and backwardly, that can be projected into the frequency domain (FCGR). The corresponding USM coordinates can be used to compute a Chebyshev distance metric as well as k-mer frequencies, without having to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFractal and DNA sequence analysis · Language and cultural evolution · Quasicrystal Structures and Properties
