# Ensemble Estimation of Generalized Mutual Information with Applications   to Genomics

**Authors:** Kevin R. Moon, Kumar Sricharan, Alfred O. Hero III

arXiv: 1701.08083 · 2021-07-30

## TL;DR

This paper introduces GENIE, an ensemble estimator for generalized mutual information that achieves parametric convergence rates in mixed discrete-continuous data, with applications in genomics and other fields.

## Contribution

The paper develops the first nonparametric mutual information estimator that attains parametric rates for mixed data, using an ensemble approach with simple implementation.

## Key findings

- Achieves $1/N$ mean squared error convergence rate in continuous and mixed cases.
- Demonstrates effectiveness on simulated data and gene relationship analysis in genomics.
- Provides theoretical guarantees including a central limit theorem and minimax rates.

## Abstract

Mutual information is a measure of the dependence between random variables that has been used successfully in myriad applications in many fields. Generalized mutual information measures that go beyond classical Shannon mutual information have also received much interest in these applications. We derive the mean squared error convergence rates of kernel density-based plug-in estimators of general mutual information measures between two multidimensional random variables $\mathbf{X}$ and $\mathbf{Y}$ for two cases: 1) $\mathbf{X}$ and $\mathbf{Y}$ are continuous; 2) $\mathbf{X}$ and $\mathbf{Y}$ may have any mixture of discrete and continuous components. Using the derived rates, we propose an ensemble estimator of these information measures called GENIE by taking a weighted sum of the plug-in estimators with varied bandwidths. The resulting ensemble estimators achieve the $1/N$ parametric mean squared error convergence rate when the conditional densities of the continuous variables are sufficiently smooth. To the best of our knowledge, this is the first nonparametric mutual information estimator known to achieve the parametric convergence rate for the mixture case, which frequently arises in applications (e.g. variable selection in classification). The estimator is simple to implement and it uses the solution to an offline convex optimization problem and simple plug-in estimators. A central limit theorem is also derived for the ensemble estimators and minimax rates are derived for the continuous case. We demonstrate the ensemble estimator for the mixed case on simulated data and apply the proposed estimator to analyze gene relationships in single cell data.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1701.08083/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/1701.08083/full.md

## References

128 references — full list in the complete paper: https://tomesphere.com/paper/1701.08083/full.md

---
Source: https://tomesphere.com/paper/1701.08083