Capacity and Expressiveness of Genomic Tandem Duplication
Siddharth Jain, Farzad Farnoud, Jehoshua Bruck

TL;DR
This paper investigates the capacity and expressiveness of genomic tandem duplication systems, providing exact capacity values and characterizing their ability to generate arbitrary substrings, with implications for understanding genetic sequence diversity.
Contribution
It offers a complete characterization of the expressiveness of tandem duplication systems across different alphabet sizes and duplication lengths, highlighting the limited expressiveness for alphabets of size four or more.
Findings
Exact capacity values for certain tandem duplication systems.
Tandem duplication systems with alphabet size ≥4 are not fully expressive.
Duplication length impacts the generative power more than the seed.
Abstract
The majority of the human genome consists of repeated sequences. An important type of repeated sequences common in the human genome are tandem repeats, where identical copies appear next to each other. For example, in the sequence , is a tandem repeat, that may be generated from by a tandem duplication of length . In this work, we investigate the possibility of generating a large number of sequences from a \textit{seed}, i.e.\ a small initial string, by tandem duplications of bounded length. We study the capacity of such a system, a notion that quantifies the system's generating power. Our results include \textit{exact capacity} values for certain tandem duplication string systems. In addition, motivated by the role of DNA sequences in expressing proteins via RNA and the genetic code, we define the notion of the \textit{expressiveness} of a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
