The Bulk and The Tail of Minimal Absent Words in Genome Sequences
Erik Aurell, Nicolas Innocenti, Hai-Jun-Zhou

TL;DR
This paper investigates the distribution of minimal absent words in genomes, demonstrating that short MAWs are statistically modeled while long MAWs reflect biological mechanisms, revealing their potential as genomic markers.
Contribution
It introduces a probabilistic model for the bulk of MAWs and the concept of MAW cores, linking long MAWs to conserved genomic regions and UTRs, highlighting their biological significance.
Findings
Bulk MAWs are explained by a random sequence model.
Long MAWs are associated with conserved regions like rRNAs.
MAW cores are enriched in UTRs, indicating evolutionary relevance.
Abstract
Minimal absent words (MAW) of a genomic sequence are subsequences that are absent themselves but the subwords of which are all present in the sequence. The characteristic distribution of genomic MAWs as a function of their length has been observed to be qualitatively similar for all living organisms, the bulk being rather short, and only relatively few being long. It has been an open issue whether the reason behind this phenomenon is statistical or reflects a biological mechanism, and what biological information is contained in absent words. In this work we demonstrate that the bulk can be described by a probabilistic model of sampling words from random sequences, while the tail of long MAWs is of biological origin. We introduce the novel concept of a core of a minimal absent word, which are sequences present in the genome and closest to a given MAW. We show that in bacteria and yeast…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
