# Convergence of Smoothed Empirical Measures with Applications to Entropy   Estimation

**Authors:** Ziv Goldfeld, Kristjan Greenewald, Yury Polyanskiy, Jonathan Weed

arXiv: 1905.13576 · 2020-05-04

## TL;DR

This paper analyzes how smoothed empirical measures converge under various statistical distances, revealing faster rates than unsmoothed measures and applying these results to high-dimensional entropy estimation with optimal convergence rates.

## Contribution

It provides new convergence rate results for smoothed empirical measures under multiple divergences and demonstrates their application to minimax optimal high-dimensional entropy estimation.

## Key findings

- Convergence rates under TV and Wasserstein distances are exponentially faster than unsmoothed cases.
- Entropy estimation using the plug-in estimator achieves minimax optimal parametric rates.
- Numerical results show the plug-in estimator outperforms general-purpose methods in high dimensions.

## Abstract

This paper studies convergence of empirical measures smoothed by a Gaussian kernel. Specifically, consider approximating $P\ast\mathcal{N}_\sigma$, for $\mathcal{N}_\sigma\triangleq\mathcal{N}(0,\sigma^2 \mathrm{I}_d)$, by $\hat{P}_n\ast\mathcal{N}_\sigma$, where $\hat{P}_n$ is the empirical measure, under different statistical distances. The convergence is examined in terms of the Wasserstein distance, total variation (TV), Kullback-Leibler (KL) divergence, and $\chi^2$-divergence. We show that the approximation error under the TV distance and 1-Wasserstein distance ($\mathsf{W}_1$) converges at rate $e^{O(d)}n^{-\frac{1}{2}}$ in remarkable contrast to a typical $n^{-\frac{1}{d}}$ rate for unsmoothed $\mathsf{W}_1$ (and $d\ge 3$). For the KL divergence, squared 2-Wasserstein distance ($\mathsf{W}_2^2$), and $\chi^2$-divergence, the convergence rate is $e^{O(d)}n^{-1}$, but only if $P$ achieves finite input-output $\chi^2$ mutual information across the additive white Gaussian noise channel. If the latter condition is not met, the rate changes to $\omega(n^{-1})$ for the KL divergence and $\mathsf{W}_2^2$, while the $\chi^2$-divergence becomes infinite - a curious dichotomy. As a main application we consider estimating the differential entropy $h(P\ast\mathcal{N}_\sigma)$ in the high-dimensional regime. The distribution $P$ is unknown but $n$ i.i.d samples from it are available. We first show that any good estimator of $h(P\ast\mathcal{N}_\sigma)$ must have sample complexity that is exponential in $d$. Using the empirical approximation results we then show that the absolute-error risk of the plug-in estimator converges at the parametric rate $e^{O(d)}n^{-\frac{1}{2}}$, thus establishing the minimax rate-optimality of the plug-in. Numerical results that demonstrate a significant empirical superiority of the plug-in approach to general-purpose differential entropy estimators are provided.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1905.13576/full.md

## Figures

10 figures with captions in the complete paper: https://tomesphere.com/paper/1905.13576/full.md

## References

45 references — full list in the complete paper: https://tomesphere.com/paper/1905.13576/full.md

---
Source: https://tomesphere.com/paper/1905.13576