From Adaptive Kernel Density Estimation to Sparse Mixture Models

Colas Schretter; Jianyong Sun; Peter Schelkens

arXiv:1812.04397·stat.ML·December 12, 2018

From Adaptive Kernel Density Estimation to Sparse Mixture Models

Colas Schretter, Jianyong Sun, Peter Schelkens

PDF

Open Access

TL;DR

This paper presents a semi-parametric method that transitions from adaptive kernel density estimation to sparse Gaussian mixture models, enabling low-complexity models that adaptively reduce components with increased smoothing.

Contribution

It introduces a balloon estimator within a generalized EM framework for automatic parameter estimation, bridging non-parametric KDE and parametric mixture models.

Findings

01

Sparse models retain detail of adaptive KDE

02

Model complexity decreases with higher smoothing

03

Method effectively estimates mixture parameters from limited data

Abstract

We introduce a balloon estimator in a generalized expectation-maximization method for estimating all parameters of a Gaussian mixture model given one data sample per mixture component. Instead of limiting explicitly the model size, this regularization strategy yields low-complexity sparse models where the number of effective mixture components reduces with an increase of a smoothing probability parameter $P > 0$ . This semi-parametric method bridges from non-parametric adaptive kernel density estimation (KDE) to parametric ordinary least-squares when $P = 1$ . Experiments show that simpler sparse mixture models retain the level of details present in the adaptive KDE solution.

Equations28

f (x) = m = 1 \sum M π_{m} N (x ∣ μ_{m}, Σ_{m}) with m = 1 \sum M π_{m} = 1,

f (x) = m = 1 \sum M π_{m} N (x ∣ μ_{m}, Σ_{m}) with m = 1 \sum M π_{m} = 1,

N (x ∣ μ, Σ) = \frac{1}{2 π ∣Σ∣} exp [- \frac{1}{2} (x - μ)^{⊤} Σ^{- 1} (x - μ)] .

N (x ∣ μ, Σ) = \frac{1}{2 π ∣Σ∣} exp [- \frac{1}{2} (x - μ)^{⊤} Σ^{- 1} (x - μ)] .

μ_{m} = x_{m} and π_{m} = \frac{1}{N} for each m \in [1, \dots, N] .

μ_{m} = x_{m} and π_{m} = \frac{1}{N} for each m \in [1, \dots, N] .

K (r ∣ x, R) = exp [- \frac{1}{2} (r - x)^{⊤} R^{- 1} (r - x)],

K (r ∣ x, R) = exp [- \frac{1}{2} (r - x)^{⊤} R^{- 1} (r - x)],

P (x_{n} ∣ S_{n})

P (x_{n} ∣ S_{n})

= m = 1 \sum M π_{m} \frac{∣ S _{n} ∣}{∣ Σ _{m} + S _{n} ∣} K (x_{n} ∣ μ_{m}, Σ_{m} + S_{n})

= m = 1 \sum M P_{m} (x_{n} ∣ S_{n}) .

R_{n} = m = 1 \sum M \frac{P _{m} ( x _{n} ∣ S _{n} )}{P ( x _{n} ∣ S _{n} )} [Σ_{m ∣ S_{n}} + (x_{n} - μ_{m ∣ S_{n}}) (x_{n} - μ_{m ∣ S_{n}})^{⊤}]

R_{n} = m = 1 \sum M \frac{P _{m} ( x _{n} ∣ S _{n} )}{P ( x _{n} ∣ S _{n} )} [Σ_{m ∣ S_{n}} + (x_{n} - μ_{m ∣ S_{n}}) (x_{n} - μ_{m ∣ S_{n}})^{⊤}]

Σ_{m ∣ S_{n}}

Σ_{m ∣ S_{n}}

μ_{m ∣ S_{n}}

σ_{n} = σ argmin P (x_{n} ∣ R_{n}) = P for each n \in [1, \dots, N] .

σ_{n} = σ argmin P (x_{n} ∣ R_{n}) = P for each n \in [1, \dots, N] .

σ_{n}^{2} \leftarrow \frac{P}{P ( x _{n} ∣ R _{n} )} σ_{n}^{2} for each n \in [1, \dots, N] .

σ_{n}^{2} \leftarrow \frac{P}{P ( x _{n} ∣ R _{n} )} σ_{n}^{2} for each n \in [1, \dots, N] .

P_{m, n}

P_{m, n}

π_{m}

π_{m}

Σ_{m}

R_{n ∣ m}

R_{n ∣ m}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference · Bayesian Methods and Mixture Models · Target Tracking and Data Fusion in Sensor Networks

Full text

From Adaptive Kernel Density Estimation to Sparse Mixture Models

Colas Schretter1,3, Jianyong Sun2 and Peter Schelkens1,3

1Vrije Universiteit Brussel (VUB), Dept. of Electronics and Informatics (ETRO), B- 1050 Brussels, Belgium.

2School of Mathematics and Statistics, Xi’an Jiaotong University, Xi’an 710049, China.

3imec, Kapeldreef 75, B-3001 Leuven, Belgium. 111This work received funding from the European Research Council under the EU FP7/2007-2013 / ERC Grant Agreement Nr. 617779 (INTERFERE).

Abstract

We introduce a balloon estimator in a generalized expectation-maximization method for estimating all parameters of a Gaussian mixture model given one data sample per mixture component. Instead of limiting explicitly the model size, this regularization strategy yields low-complexity sparse models where the number of effective mixture components reduces with an increase of a smoothing probability parameter $\mathbf{P>0}$ . This semi-parametric method bridges from non-parametric adaptive kernel density estimation (KDE) to parametric ordinary least-squares when $\mathbf{P=1}$ . Experiments show that simpler sparse mixture models retain the level of details present in the adaptive KDE solution.

1 Introduction

In bivariate adaptive kernel density estimation (KDE), we wish to solve the problem of estimating all individual variances-covariance matrices $\{\Sigma_{1},\dots,\Sigma_{N}\}\in\mathbb{R}^{2\times 2}$ associated to $N$ input point samples $\{x_{1},\dots,x_{N}\}\in\mathbb{R}^{2}$ such that the underlying continuous density function $f(x)$ is represented by a finite Gaussian mixture model (GMM) with $M=N$ components:

[TABLE]

where $\mathcal{N}$ is the multivariate Gaussian probability density

[TABLE]

Non-parametric estimation approaches like KDE constrain the means of mixture components to observed point samples and assumes uniform prior probabilities such that

[TABLE]

If component’s means are allowed to shift, we enter the realm of semi-parametric density estimation: A highly underdetermined problem that is classically regularized by enforcing a modest model size $M<<N$ and estimating jointly all parameters with the maximum-likelihood (ML) framework, leading to simple update rules in an expectation-maximization (EM) algorithm.

Semi-parametric estimation faces two unsolved problems that are (lightly) avoided by the non-parametric framework: How to choose the model size? How to initialize parameters of mixture components? In answer, this work proposes to start from a full mixture model with $M=N$ , initialized to the trivial ML solution where components fits data points with near-singular variances-covariance matrices. A regularization strategy uses a balloon estimator [2] on the current density estimate for convolving locally and adaptively the data. After convergence, a sparse lower-complexity mixture model emerges.

2 Method

The initial ML solution is attributing a very small mixture component to each data sample and fails at generalizing in regions where no sample is collected. Regularization is introduced for accounting the free space in-between samples by iteratively updating an isotropic balloon estimate per data element. A multivariate data-driven regularizing kernel is then computed analytically. Each iteration of a generalized EM algorithm updates jointly all parameters, given sample’s positions and their estimated kernel representing their spatial "territory". This alternating optimization of data-space ballooning and solution-space EM converges to a stationary point. The only controllable parameter for smoothing is the prior probability $P\in(0,1]$ .

2.1 Balloon estimator

Given a density function $f$ and any position $x\in\mathbb{R}^{2}$ , a root finding numerical method can search for the unique variance $\sigma^{2}$ such that the integrated product of the density with a corresponding peak-normalized anisotropic multivariate kernel

[TABLE]

should be equal to the given regularization parameter $P$ .

For each isotropic balloon of variance $S_{n}=\sigma_{n}^{2}I\in\mathbb{R}^{2\times 2}$ centered at $x_{n}$ , the kernel matrix $R_{n}$ is built by computing the ML fit of the product of the current density function with the local soft spatial "territory" covered by the balloon estimate $S_{n}$ .

The integral of products can be evaluated in closed form since the density $f$ is expressed by a GMM and we have

[TABLE]

Since the density model is a GMM, we can compute analytically the integral of the product between the balloon kernel and each mixture component. Therefore, multivariate data-adaptive regularizing kernels have the following simple closed form:

[TABLE]

with the parameters of all products with the balloon kernel:

[TABLE]

We apply a balloon estimator independently at every input data samples $x_{n}$ , giving us the convex optimization problems

[TABLE]

In practice, only a few (or even a single) steps of a fixed point iteration may be used for updating $\sigma_{n}$ since the objective increases monotonously with the variance. This gives the following sequence of multiplicative updates starting with $\sigma_{n}\leftarrow 1$ :

[TABLE]

We quit the loop whenever $(P(x_{n}|R_{n})-P)^{2}<(P\times 0.01)^{2}$ .

This solution is a continuous variant of $K$ -nearest neighbors (KNN), where the hard disc indicator function is replaced by a soft multivariate Gaussian kernel that steers to trends in the data and the count $K<N$ is replaced by the probability $P<1$ .

2.2 Regularized expectation-maximization

The E-step computes a partition of unity for each "inflated" point sample $x_{n}$ . Sample’s positions are exact and the regularized mixture model represents the continuous density including an estimate of missing data samples. Thus, the E-step is simply

[TABLE]

The M-step updates prior probabilities, means and matrices by using the the regularizing matrices $R_{n}$ from the balloon estimator as a prior for the variances-covariance matrices:

[TABLE]

with the regularizing additive matrices

[TABLE]

3 Experiments

We have drawn $N=64$ random samples in a square and iterated 1000 times, until almost sure convergence. Figure 1 shows two results with moderate and strong smoothing prior probabilities, comparing data overfitting with a fair adaptive smoothing. The adaptive KDE density is simply using the augmented data with multivariate regularizing kernels at sample’s positions. Oriented kernels with elongated elliptic footprints have smaller determinants compared to their corresponding balloons. Thus, they apply lighter constraints on parametric estimations.

Experiments demonstrate that even if the balloon estimator computes a single scalar variance per data sample, the adapted regularizing kernels steer to the shape of local data clusters. The EM approach yields much fewer representative mixture components without noticeable loss of quality. The solutions converged to only $45$ and $21$ effective components for $P=1/64$ and $P=1/32$ , respectively. Results are comparable to model selection through variational Bayes inference [3], but without using the (a priori) conjugate priors assumption.

4 Conclusion

This work introduces a regularized expectation-maximization method for estimating mixture density models of arbitrary size given an incomplete set of point samples. A probability parameter drives the complexity of the solution from maximum-likelihood with one component per sample to ordinary least squares with a single component. We estimate full-covariance kernels for anisotropic local regularization. Sparse GMMs are subsuming the data in an effective way and densities are similar to adaptive KDE. Ongoing work tackle image approximation and restoration using such a regularized continuous model [4].

Bibliography4

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Jianyong Sun, and Ata Kabán, “A fast algorithm for robust mixtures in the presence of measurement errors,” IEEE Transactions on Neural Networks , vol. 21, no. 8, pp. 1206–1220, 2010.
2[2] Stephan R. Sain, “Multivariate locally adaptive density estimation,” Computational Statistics & Data Analysis , vol. 39, no. 2, pp. 165–186, 2002.
3[3] Dimitris Tzikas, Aristidis Likas, and Nikolaos Galatsanos, “The variational approximation for Bayesian inference,” IEEE Signal Processing Magazine , vol. 25, no. 6, pp. 131–146, 2008.
4[4] Colas Schretter, Jianyong Sun, and Peter Schelkens, “Image reconstruction with smoothed mixtures of regressions,” in Proc. of IEEE International Conference on Image Processing , 2018.