SENECA: Small-Sample Discrete Entropy Estimation via Self-Consistent Missing Mass
Lucas H. McCabe, H. Howie Huang

TL;DR
SENECA introduces a novel self-consistent missing mass approach for small-sample discrete entropy estimation, outperforming existing methods and applicable to diverse practical scenarios.
Contribution
The paper presents SENECA, a new entropy estimator that effectively accounts for unobserved support mass using a self-consistent method, improving accuracy in small samples.
Findings
SENECA outperforms state-of-the-art estimators in small-sample scenarios.
Applied to biodiversity and language model response detection, SENECA shows competitive performance.
The method serves as a versatile drop-in replacement for small-sample entropy estimation.
Abstract
Discrete entropy estimation is a classic information theory problem, wherein the average information content of a discrete random variable is estimated from samples alone. Naive approaches, such as the plugin method, fail to account for the probability mass associated with members of the random variable's support that are unobserved in a given sample, known as the "missing mass." The resulting systemic underestimation is particularly problematic when data is time-consuming or costly to gather. We propose SENECA, an entropy estimation scheme based on a novel ``self-consistent'' missing mass calculation. Extensive numerical experiments indicate that our approach outperforms many state-of-the-art alternatives overall in the small-sample setting. We then apply SENECA to two practical use cases, namely biodiversity estimation and the detection of incorrect large language model responses,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
