Missing $g$-mass: Investigating the Missing Parts of Distributions
Prafulla Chandra, Andrew Thangaraj

TL;DR
This paper introduces the concept of missing g-mass to analyze unobserved parts of large distributions, providing new estimation techniques and concentration bounds for various functions of the missing distribution.
Contribution
It defines missing g-mass, studies minimax estimation for order-alpha missing mass, and develops new concentration bounds including strongly sub-Gamma and filtered sub-Gaussian types.
Findings
Exact minimax convergence rates for order-alpha missing mass.
Sub-Gaussian tail bounds with near-optimal variance factors.
Introduction of strongly sub-Gamma and filtered sub-Gaussian concentration notions.
Abstract
Estimating the underlying distribution from \textit{iid} samples is a classical and important problem in statistics. When the alphabet size is large compared to number of samples, a portion of the distribution is highly likely to be unobserved or sparsely observed. The missing mass, defined as the sum of probabilities over the missing letters , and the Good-Turing estimator for missing mass have been important tools in large-alphabet distribution estimation. In this article, given a positive function from to the reals, the missing -mass, defined as the sum of over the missing letters , is introduced and studied. The missing -mass can be used to investigate the structure of the missing part of the distribution. Specific applications for special cases such as order- missing mass () and the missing Shannon…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsBayesian Methods and Mixture Models · Statistical Mechanics and Entropy · Machine Learning and Algorithms
