# Consistent estimation of the missing mass for feature models

**Authors:** Fadhel Ayed, Marco Battiston, Federico Camerlenghi, Stefano Favaro

arXiv: 1902.10530 · 2019-02-28

## TL;DR

This paper investigates the challenge of estimating the unseen features in feature models, proving the non-existence of a universal consistent estimator and demonstrating the consistency of a specific estimator under heavy-tailed probabilities.

## Contribution

It establishes the impossibility of universal consistent estimation of the missing mass and shows the consistency of a particular estimator for heavy-tailed feature probabilities.

## Key findings

- No universally consistent estimator exists for the missing mass.
- The estimator by Ayed et al. (2017) is strongly consistent for heavy-tailed probabilities.
- Derived concentration inequalities for missing mass and feature frequency counts.

## Abstract

Feature models are popular in machine learning and they have been recently used to solve many unsupervised learning problems. In these models every observation is endowed with a finite set of features, usually selected from an infinite collection $(F_{j})_{j\geq 1}$. Every observation can display feature $F_{j}$ with an unknown probability $p_{j}$. A statistical problem inherent to these models is how to estimate, given an initial sample, the conditional expected number of hitherto unseen features that will be displayed in a future observation. This problem is usually referred to as the missing mass problem. In this work we prove that, using a suitable multiplicative loss function and without imposing any assumptions on the parameters $p_{j}$, there does not exist any universally consistent estimator for the missing mass. In the second part of the paper, we focus on a special class of heavy-tailed probabilities $(p_{j})_{j\geq 1}$, which are common in many real applications, and we show that, within this restricted class of probabilities, the nonparametric estimator of the missing mass suggested by Ayed et al. (2017) is strongly consistent. As a byproduct result, we will derive concentration inequalities for the missing mass and the number of features observed with a specified frequency in a sample of size $n$.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1902.10530/full.md

## References

33 references — full list in the complete paper: https://tomesphere.com/paper/1902.10530/full.md

---
Source: https://tomesphere.com/paper/1902.10530