Genome Sizes and the Benford Distribution
James L. Friar, Terrance Goldman, Juan P\'erez-Mercader

TL;DR
This paper investigates genome size distributions across the three domains of life, revealing that Eukaryotic genomes follow a Benford distribution, and provides insights into genome size limits and biological complexity.
Contribution
It predicts that Eukaryotic genome sizes follow a Benford distribution and validates this with empirical data, linking genome features to entropy and information theory.
Findings
Eukaryotic genome sizes fit a Benford distribution over several orders of magnitude.
Prokaryotic genome sizes grow linearly with genome size, fitting a different model.
Estimated maximal Prokaryote genome size is around 8-12 megabasepairs.
Abstract
Data on the number of Open Reading Frames (ORFs) coded by genomes from the 3 domains of Life show some notable general features including essential differences between the Prokaryotes and Eukaryotes, with the number of ORFs growing linearly with total genome size for the former, but only logarithmically for the latter. Assuming that the (protein) coding and non-coding fractions of the genome must have different dynamics and that the non-coding fraction must be controlled by a variety of (unspecified) probability distribution functions, we are able to predict that the number of ORFs for Eukaryotes follows a Benford distribution and has a specific logarithmic form. Using the data for 1000+ genomes available to us in early 2010, we find excellent fits to the data over several orders of magnitude, in the linear regime for the Prokaryote data, and the full non-linear form for the Eukaryote…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
