Range-Limited Heaps' Law for Functional DNA Words in the Human Genome
Wentian Li, Yannis Almirantis, Astero Provata

TL;DR
This study demonstrates the existence of a range-limited Heaps' law in the human genome and other animal genomes using a specific DNA word definition related to protein-coding regions, revealing insights into genomic redundancy and diversity.
Contribution
It establishes the presence of Heaps' law in genomes with a novel DNA word definition and analyzes its properties across multiple species.
Findings
Heaps' law exists in the human genome within a limited range.
Range-limited Heaps' law is observed in several animal genomes with different exponents.
Deviations occur at maximum sample sizes, but quadratic fits are accurate in log-log plots.
Abstract
Heaps' or Herdan's law is a linguistic law describing the relationship between the vocabulary/dictionary size (type) and word counts (token) to be a power-law function. Its existence in genomes with certain definition of DNA words is unclear partly because the dictionary size in genome could be much smaller than that in a human language. We define a DNA word as a coding region in a genome that codes for a protein domain. Using human chromosomes and chromosome arms as individual samples, we establish the existence of Heaps' law in the human genome within limited range. Our definition of words in a genomic or proteomic context is different from other definitions such as over-represented k-mers which are much shorter in length. Although an approximate power-law distribution of protein domain sizes due to gene duplication and the related Zipf's law is well known, their translation to the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
