Entropy-Rank Ratio: A Novel Entropy-Based Perspective for DNA Complexity and Classification
Emmanuel Pio Pastore, Giuseppe Passarino, Peppino Sapia, Francesco De Rango

TL;DR
This paper introduces the entropy rank ratio, a new normalized measure for DNA sequence complexity that overcomes Shannon entropy saturation issues and improves classification accuracy when integrated into neural network data augmentation.
Contribution
The paper presents the entropy rank ratio, a novel distribution-aware entropy measure for DNA sequences, and demonstrates its effectiveness in enhancing neural network classification performance.
Findings
R provides a normalized complexity measure in [0,1] avoiding saturation.
Using R for data augmentation improves neural network accuracy.
Benchmark results show significant gains with lightweight models.
Abstract
Shannon entropy is widely used to measure the complexity of DNA sequences but suffers from saturation effects that limit its discriminative power for long uniform segments. We introduce a novel metric, the entropy rank ratio R, which positions a target sequence within the full distribution of all possible sequences of the same length by computing the proportion of sequences that have an entropy value equal to or lower than that of the target. In other words, R expresses the relative position of a sequence within the global entropy spectrum, assigning values close to 0 for highly ordered sequences and close to 1 for highly disordered ones. DNA sequences are partitioned into fixed-length subsequences and non-overlapping n-mer groups; frequency vectors become ordered integer partitions and a combinatorial framework is used to derive the complete entropy distribution. Unlike classical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFractal and DNA sequence analysis · Genomics and Chromatin Dynamics · RNA and protein synthesis mechanisms
