Information Theory and the Length Distribution of all Discrete Systems
Les Hatton, Gregory Warr

TL;DR
This paper reveals that diverse discrete systems such as proteins and computer functions share similar length distributions, explained by a new information-theoretic framework based on the Conservation of Hartley-Shannon information, which predicts universal patterns like power-laws and Zipf's law.
Contribution
It introduces a novel theoretical framework embedding CoHSI into statistical mechanics, unifying the length distributions of heterogeneous and homogeneous discrete systems and validating predictions across multiple domains.
Findings
Length distributions of proteins and computer functions are qualitatively identical.
The CoHSI framework predicts unimodal and power-law distributions for different system types.
Different but related alphabets in heterogeneous systems follow power-law relations.
Abstract
We begin with the extraordinary observation that the length distribution of 80 million proteins in UniProt, the Universal Protein Resource, measured in amino acids, is qualitatively identical to the length distribution of large collections of computer functions measured in programming language tokens, at all scales. That two such disparate discrete systems share important structural properties suggests that yet other apparently unrelated discrete systems might share the same properties, and certainly invites an explanation. We demonstrate that this is inevitable for all discrete systems of components built from tokens or symbols. Departing from existing work by embedding the Conservation of Hartley-Shannon information (CoHSI) in a classical statistical mechanics framework, we identify two kinds of discrete system, heterogeneous and homogeneous. Heterogeneous systems contain components…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsStatistical Mechanics and Entropy · Fractal and DNA sequence analysis · Protein Structure and Dynamics
