Solving general elliptical mixture models through an approximate   Wasserstein manifold

Shengxi Li; Zeyang Yu; Min Xiang; Danilo Mandic

arXiv:1906.03700·cs.LG·October 9, 2020

Solving general elliptical mixture models through an approximate Wasserstein manifold

Shengxi Li, Zeyang Yu, Min Xiang, Danilo Mandic

PDF

1 Repo

TL;DR

This paper introduces a novel stable and robust method for estimating elliptical mixture models using an approximate Wasserstein distance, outperforming traditional approaches in stability and convergence.

Contribution

It proposes an efficient optimization framework on a statistical manifold with an approximate Wasserstein distance for elliptical mixture models, improving stability and convergence.

Findings

01

Outperforms existing methods in stability and accuracy

02

Provides a unifying account of computable elliptical mixture models

03

Demonstrates superior performance through experiments

Abstract

We address the estimation problem for general finite mixture models, with a particular focus on the elliptical mixture models (EMMs). Compared to the widely adopted Kullback-Leibler divergence, we show that the Wasserstein distance provides a more desirable optimisation space. We thus provide a stable solution to the EMMs that is both robust to initialisations and reaches a superior optimum by adaptively optimising along a manifold of an approximate Wasserstein distance. To this end, we first provide a unifying account of computable and identifiable EMMs, which serves as a basis to rigorously address the underpinning optimisation problem. Due to a probability constraint, solving this problem is extremely cumbersome and unstable, especially under the Wasserstein distance. To relieve this issue, we introduce an efficient optimisation method on a statistical manifold defined under an…

Tables5

Table 1. Table 1: ℛ 2 ↔ ↔ superscript ℛ 2 absent \mathcal{R}^{2}\leftrightarrow Computable elliptical distributions

Types	$ℛ^{2}$	$\leftrightarrow c_{m} \cdot g (t)$	Typical Multivariate Dist.
Kotz Type Ref[a]	$ℛ^{2} =^{d} 𝒢^{1 / s},$	$= (\frac{Γ (m / 2) s b^{(2 a + m - 2) / (2 s)}}{Γ ((2 a + m - 2) / 2 s) π^{m / 2}}) t^{a - 1} \exp (- b t^{s})$	Gamma: $s = 1$
	$ℛ^{2} =^{d} 𝒢^{1 / s},$		Weibull: $a = s$
	$𝒢 \sim Ga (\frac{2 a + m - 2}{2 s}, b)$		Generalised Gaussian: $a = 1$
	$a > 1 - \frac{m}{2}, b, s > 0$		Gaussian: $a = 1, s = 1, b = \frac{1}{2}$
Scale Mixture of Normals	Pearson Type VII Ref[a]	$= (\frac{{(π v)}^{- m / 2} Γ (s)}{Γ (s - m / 2)}) {(1 + t / v)}^{- s}$	$T$ -dist.: $s = \frac{m + v}{2}$
	Pearson Type VII Ref[a]		$T$ -dist.: $s = \frac{m + v}{2}$
	$𝒦^{- 1} \sim Ga (s - \frac{m}{2}, \frac{v}{2})$ ,		Cauchy: $v = 1$ , $s = \frac{m + 1}{2}$
	$v > 0$ , $s > m / 2$		Cauchy: $v = 1$ , $s = \frac{m + 1}{2}$
$ℛ^{2} =^{d} 𝒢 \cdot 𝒦$ , $𝒢 \sim Ga (\frac{m}{2}, \frac{1}{2})$ , $𝒦$ has different dist.	Hyperbolic Type Ref[b]	$= (\frac{{(v / a)}^{λ / 2}}{{(2 π)}^{m / 2} {BeK}_{λ} (\sqrt{a v})}) \frac{{BeK}_{(λ - m / 2)} (\sqrt{a v + v t})}{{(\sqrt{a / v + t / v})}^{m / 2 - λ}}$	Inverse-Gaussian: $λ = - 1 / 2$
	$𝒦 \sim GIG (v, a, λ)$		$K$ -dist. (?): $a \to 0, λ > 0$
	$v, a > 0, λ \in ℝ$		Laplace: $a \to 0, λ = 1, v = 2$
	Other Types Ref[c]	$= \exp (- t) / {(1 + \exp (- t))}^{2}$	Logistic
	$\sqrt{𝒦} \sim \partial Kov (\frac{𝒦}{2}) / \partial 𝒦$	$= \exp (- t) / {(1 + \exp (- t))}^{2}$	Logistic
	$𝒦 \sim S α S (\frac{a}{2})$ , $a \in (0, 2)$	$\propto S α S (a)$	$α$ -stable
Pearson Type II	$ℛ^{2} \sim Beta (m / 2, s)$ , $s > 1$	$= (\frac{Γ (m / 2 + s)}{π^{m / 2} Γ (s)}) {(1 - t)}^{s - 1}$ , $t \in [0, 1]$	Ref[a]
	$ℛ^{2} \sim Beta (m / 2, s)$ , $s > 1$		Ref[a]
Notations:	$a, b, s, v, λ, α$ are adjustable parameters for different types of dist.; $𝒢$ and $𝒦$ are random variables related to $ℛ^{2}$ ; $m$ is the dimension.
	Gamma dist.: $Ga (x, y) = y^{x} t^{x - 1} \exp (- y t) / Γ (x)$ ; Inverse Gaussian dist.: $GIG (x, y, z) = \frac{{(x / y)}^{z / 2}}{2 B e K_{z} (\sqrt{x y})} t^{z - 1} \exp (- \frac{x t^{2} + y}{2 t})$ ;
	Kolmogorov-Smirnov dist.: $Kov (x) = 1 - 2 \sum_{n = 1}^{\infty} {(- 1)}^{n + 1} \exp (- 2 n^{2} x^{2})$ ; Beta dist.: $\frac{Γ (x + y)}{Γ (x) Γ (y)} t^{x - 1} {(1 - t)}^{y - 1}$ ;
	$S α S (a)$ : the symmetric $α$ -stable dist. with index $a$ ; ${BeK}_{x} (y)$ : the Bessel function of the third kind; $Γ (x)$ : the Gamma function.
References:	[a]: (?); [b]: (?); [c]: (?; ?)

Table 2. Table 2: Basic operations of the manifold in Lemma 2

For the $h$ -th iteration
	$\nabla_{U} (\cdot)$	$Exp (- α \nabla_{U} (\cdot))$
${\sqrt{𝝅}}^{h}$	$\nabla_{E} ({\sqrt{𝝅}}^{h}) - {({\sqrt{𝝅}}^{h})}^{T} \nabla_{E} ({\sqrt{𝝅}}^{h}) \cdot {\sqrt{𝝅}}^{h}$	$\cos ({‖ α \nabla_{U} ({\sqrt{𝝅}}^{h}) ‖}_{2}) {\sqrt{𝝅}}^{h} - \frac{\sin ({‖ α \nabla_{U} ({\sqrt{𝝅}}^{h}) ‖}_{2})}{{‖ \nabla_{U} ({\sqrt{𝝅}}^{h}) ‖}_{2}} \nabla_{U} ({\sqrt{𝝅}}^{h})$
$𝝁_{i}^{h}$	$\nabla_{E} (𝝁_{i}^{h})$	$𝝁_{i}^{h} - α \nabla_{U} (𝝁_{i}^{h})$
$𝚺_{i}^{h}$	$\nabla_{E} (𝚺_{i}^{h}) 𝚺_{i}^{h} + 𝚺_{i}^{h} \nabla_{E} (𝚺_{i}^{h})$	$(𝕃_{𝚺_{i}^{h}} [- α \nabla_{U} (𝚺_{i}^{h})] + I) 𝚺_{i}^{h} (𝕃_{𝚺_{i}^{h}} [- α \nabla_{U} (𝚺_{i}^{h})] + I)$
For the Radon transform $𝐩 \in 𝕊^{m - 1}$ , specified in the EMM problems:
$\nabla_{E} ({\sqrt{𝝅}}^{h})$	The $i$ -th dimension of $\nabla_{E} ({\sqrt{𝝅}}^{h})$ is $2 \int_{ℝ} c_{m} ϕ (y) {\sqrt{π_{i}}}^{h} {(𝐩^{T} 𝚺_{i}^{h} 𝐩)}^{- 1 / 2} g (\frac{{(y - 𝐩^{T} 𝝁_{i})}^{2}}{𝐩^{T} 𝚺_{i}^{h} 𝐩}) 𝑑 y$
$\nabla_{E} (𝝁_{i}^{h})$	$= (- 2 \int_{ℝ} c_{m} ϕ (y) π_{i}^{h} {(𝐩^{T} 𝚺_{i}^{h} 𝐩)}^{- 3 / 2} g^{'} (\frac{{(y - 𝐩^{T} 𝝁_{i})}^{2}}{𝐩^{T} 𝚺_{i}^{h} 𝐩}) (y - 𝐩^{T} 𝝁_{i}^{h}) 𝑑 y) 𝐩$
$\nabla_{E} (𝚺_{i}^{h})$	$= (- \int_{ℝ} c_{m} ϕ (y) π_{i}^{h} {(𝐩^{T} 𝚺_{i}^{h} 𝐩)}^{- 3 / 2} (\frac{1}{2} g (\frac{{(y - 𝐩^{T} 𝝁_{i}^{h})}^{2}}{𝐩^{T} 𝚺_{i}^{h} 𝐩}) + g^{'} (\frac{{(y - 𝐩^{T} 𝝁_{i}^{h})}^{2}}{𝐩^{T} 𝚺_{i}^{h} 𝐩}) \frac{{(y - 𝐩^{T} 𝝁_{i}^{h})}^{2}}{𝐩^{T} 𝚺_{i}^{h} 𝐩}) 𝑑 y) {𝐩𝐩}^{T}$
$ϕ (y)$ is the Kantorovich potential (?). $\nabla_{E} (\cdot)$ denotes the Euclidean gradient with regard to ${\sqrt{𝝅}}^{h}$ , $𝝁_{i}^{h}$ and $𝚺_{i}^{h}$ .

Alg. 1: Riemannian adaptively accelerated manifold optimisation
Input: $n$ observed samples $𝐲_{1}, 𝐲_{2}, \dots, 𝐲_{n}$ ; stepsize ${α^{h}}_{h = 1}^{H}$ ;
		hyper-parameters ${β_{1}^{h}}_{h = 1}^{H}$ and $β_{2}$
Initialise: $1^{s t}$ -order moment ${𝐮_{i}^{0}}_{i = 1}^{k}$ ; $2^{n d}$ -order moment ${𝐯_{i}^{0}}_{i = 1}^{k}$ ;
	for: $h = 1$ to $H$ do
		Random projection: $𝐩 \in 𝕊^{(m - 1)}$
		Update ${\sqrt{𝝅}}^{h + 1} = Exp (- α \nabla_{U} ({\sqrt{𝝅}}^{h}))$
		for: $i = 1$ to $k$ do
			Update $𝝁_{i}^{h + 1} = Exp (- α \nabla_{U} (𝝁_{i}^{h}))$
				Update $𝚺_{i}^{h}$ by the Dadam:
				$𝐮_{i}^{h} = β_{1}^{h} φ_{𝚺_{i}^{h - 1} \to 𝚺_{i}^{h}} (𝐮_{i}^{h - 1}) + (1 - β_{1}^{h}) \nabla_{U} (𝚺_{i}^{h})$
				$𝐯_{i}^{h} = β_{2} 𝐯_{i}^{h - 1} + (1 - β_{2}) \nabla_{E} (𝚺_{i}^{h}) \nabla_{E} {(𝚺_{i}^{h})}^{T}$
				${adp}_{i}^{h} = \max {𝐩^{T} 𝐯_{i}^{h} 𝐩, {adp}_{i}^{h - 1}}$
				$𝚺_{i}^{h + 1} = Exp (- α^{h} 𝐮_{i}^{h} / \sqrt{{adp}_{i}^{h}})$
		end for
	end for
Return: $𝝅^{H}, {𝝁_{i}^{H}}_{i = 1}^{k}, {𝚺_{i}^{H}}_{i = 1}^{k}$

Table 4. Table 3: Comparisons among our, WO/M and EM methods for GMMs by varying m 𝑚 m and k 𝑘 k .

	$m = 2, k = 3$			$m = 8, k = 9$			$m = 16, k = 27$
	Wass	NLL	Time per Ite.	Wass	NLL	Time per Ite.	Wass	NLL	Time per Ite.
W/M+Dadam	0.01 $\pm$ 0.00	5.10 $\pm$ 0.00	4.35ms	0.03 $\pm$ 0.03	19.49 $\pm$ 0.03	9.83ms	2.39 $\pm$ 0.33	44.92 $\pm$ 1.04	32.51ms
WO/M	0.03 $\pm$ 0.00	5.12 $\pm$ 0.00	4.23ms	4.26 $\pm$ 3.61	22.27 $\pm$ 1.69	9.64ms	5e3 $\pm$ 589	Inf	29.77ms
EM	0.65 $\pm$ 1.06	5.30 $\pm$ 0.35	3.10ms	0.70 $\pm$ 0.46	19.83 $\pm$ 0.29	13.92ms	2.51 $\pm$ 1.27	48.84 $\pm$ 1.24	215.00ms

Table 5. Table 4: Comparison performance of our, WO/M and IRA algorithms over the BSDS500 dataset via five metrics

	Gaussian Mixture			Logistic Mixture			Cauchy Mixture			Gamma Mixture
	Our	WO/M	IRA	Our	WO/M	IRA	Our	WO/M	IRA	Our	WO/M	IRA
Wass	35.0	89.5	53.6	45.8	154	68.0	179	211	253	35.5	510	–
Wass	$\pm$ 4.68	$\pm$ 10.0	$\pm$ 21.9	$\pm$ 12.7	$\pm$ 10.9	$\pm$ 24.2	$\pm$ 15.3	$\pm$ 7.77	$\pm$ 65.4	$\pm$ 1.82	$\pm$ 71.2	–
NLL	11.8	12.2	11.8	10.6	12.5	10.6	12.1	12.3	12.0	13.3	19.1	–
NLL	$\pm$ 0.01	$\pm$ 0.11	$\pm$ 0.05	$\pm$ 0.02	$\pm$ 0.25	$\pm$ 0.06	$\pm$ 0.01	$\pm$ 0.01	$\pm$ 0.05	$\pm$ 0.07	$\pm$ 0.86	–
PSNR	18.9	18.6	18.3	18.8	18.2	17.8	19.3	19.3	18.4	21.0	18.0	–
(dB)	$\pm$ 0.07	$\pm$ 0.18	$\pm$ 0.66	$\pm$ 0.11	$\pm$ 0.18	$\pm$ 0.67	$\pm$ 0.05	$\pm$ 0.06	$\pm$ 0.56	$\pm$ 0.08	$\pm$ 0.58	–
SSIM	0.68	0.65	0.66	0.67	0.62	0.63	0.70	0.70	0.69	0.73	0.59	–
SSIM	$\pm$ 0.00	$\pm$ 0.01	$\pm$ 0.03	$\pm$ 0.01	$\pm$ 0.01	$\pm$ 0.03	$\pm$ 0.00	$\pm$ 0.00	$\pm$ 0.02	$\pm$ 0.00	$\pm$ 0.03	–
FailR	0.20 $%$	53.1 $%$	2.98 $%$	0.46 $%$	75.5 $%$	2.68 $%$	1.16 $%$	0.82 $%$	17.16 $%$	0 $%$	2.02 $%$	100 $%$

Equations82

X =^{d} μ + R Λ S,

X =^{d} μ + R Λ S,

p (x) = c_{m} 2 π^{- \frac{m}{2}} Γ (\frac{m}{2}) det (Σ)^{- \frac{1}{2}} g (t (x - μ)^{T} Σ^{- 1} (x - μ)) .

p (x) = c_{m} 2 π^{- \frac{m}{2}} Γ (\frac{m}{2}) det (Σ)^{- \frac{1}{2}} g (t (x - μ)^{T} Σ^{- 1} (x - μ)) .

p (y) = i = 1 \sum k π_{i} c_{m} det (Σ_{i})^{- \frac{1}{2}} g ((y - μ_{i})^{T} Σ_{i}^{- 1} (y - μ_{i})),

p (y) = i = 1 \sum k π_{i} c_{m} det (Σ_{i})^{- \frac{1}{2}} g ((y - μ_{i})^{T} Σ_{i}^{- 1} (y - μ_{i})),

d_{U} (Y_{1}, Y_{2}) = γ (i, j) min

d_{U} (Y_{1}, Y_{2}) = γ (i, j) min

\displaystyle+\arccos(\sum_{i,j}\gamma(i,j)\sqrt{\pi_{i,1}\pi_{j,2}})\big{)},

d_{W}^{2} (Y_{1}, Y_{2}) \leq d_{U} (Y_{1}, Y_{2}) .

d_{W}^{2} (Y_{1}, Y_{2}) \leq d_{U} (Y_{1}, Y_{2}) .

d s^{2} = \frac{E [ R ^{2} ]}{m} (L_{Σ_{i}} [d Σ]) Σ_{i} (L_{Σ_{i}} [d Σ]) .

d s^{2} = \frac{E [ R ^{2} ]}{m} (L_{Σ_{i}} [d Σ]) Σ_{i} (L_{Σ_{i}} [d Σ]) .

Y_{θ + Δ θ^{*}} =

Y_{θ + Δ θ^{*}} =

\to Δ θ^{*} = Exp (- α \nabla_{U} (θ))

p_{R^{2}} (t) = \frac{1}{2} \cdot g (t) \cdot t^{\nicefrac m 2 - 1} .

p_{R^{2}} (t) = \frac{1}{2} \cdot g (t) \cdot t^{\nicefrac m 2 - 1} .

c = (\int_{0}^{\infty} t^{2 a + m - 3} exp (- b t^{2 s}) d t)^{- 1} = \frac{2 s b ^{\frac{2 a + m - 2}{2 s}}}{Γ ( \frac{2 a + m - 2}{2 s} )} .

c = (\int_{0}^{\infty} t^{2 a + m - 3} exp (- b t^{2 s}) d t)^{- 1} = \frac{2 s b ^{\frac{2 a + m - 2}{2 s}}}{Γ ( \frac{2 a + m - 2}{2 s} )} .

p_{R^{2}} (t) = \frac{s b ^{\frac{2 a + m - 2}{2 s}}}{Γ ( \frac{2 a + m - 2}{2 s} )} t^{\frac{m}{2} + a - 2} exp (- b t^{s}) .

p_{R^{2}} (t) = \frac{s b ^{\frac{2 a + m - 2}{2 s}}}{Γ ( \frac{2 a + m - 2}{2 s} )} t^{\frac{m}{2} + a - 2} exp (- b t^{s}) .

p_{G^{1/ s}} (t)

p_{G^{1/ s}} (t)

= \frac{s b ^{\frac{2 a + m - 2}{2 s}}}{Γ ( \frac{2 a + m - 2}{2 s} )} t^{\frac{m}{2} + a - 2} exp (- b t^{s}) .

p_{X} (x) = \int_{t} p_{X} (x ∣ t) p_{K} (t) d t \propto \int_{t} t^{- \frac{m}{2}} exp (\frac{x ^{T} Σ ^{- 1} x}{t}) p_{K} (t) d t .

p_{X} (x) = \int_{t} p_{X} (x ∣ t) p_{K} (t) d t \propto \int_{t} t^{- \frac{m}{2}} exp (\frac{x ^{T} Σ ^{- 1} x}{t}) p_{K} (t) d t .

X = K N = K \cdot χ_{m}^{2} \cdot S = K \cdot G S,

X = K N = K \cdot χ_{m}^{2} \cdot S = K \cdot G S,

c

c

= \frac{2Γ ( s )}{( v ) ^{m /2} Γ ( s - m /2 ) Γ ( m /2 )} .

p_{R^{2}} (t) = \frac{Γ ( s )}{Γ ( s - \frac{m}{2} ) Γ ( \frac{m}{2} ) v ^{\frac{m}{2}}} (1 + \frac{t}{v})^{- s} t^{\frac{m}{2} - 1} .

p_{R^{2}} (t) = \frac{Γ ( s )}{Γ ( s - \frac{m}{2} ) Γ ( \frac{m}{2} ) v ^{\frac{m}{2}}} (1 + \frac{t}{v})^{- s} t^{\frac{m}{2} - 1} .

p_{R^{2}} (t) = \int_{0}^{\infty} p_{G} (t ∣ τ) p_{K} (τ) d τ

p_{R^{2}} (t) = \int_{0}^{\infty} p_{G} (t ∣ τ) p_{K} (τ) d τ

= \int_{0}^{\infty} \frac{( \frac{2}{τ} ) ^{\frac{m}{2}} t ^{\frac{m}{2} - 1} exp ( - \frac{t}{2 τ} )}{Γ ( \frac{m}{2} )} \frac{( \frac{v}{2} ) ^{s - \frac{m}{2}}}{Γ ( s - \frac{m}{2} )} τ^{\frac{m}{2} - s - 1} exp (- \frac{v}{2 τ}) d τ

= \frac{Γ ( s ) t ^{\frac{m}{2} - 1}}{Γ ( s - \frac{m}{2} ) Γ ( \frac{m}{2} ) v ^{\frac{m}{2}} ( 1 + \frac{t}{v} ) ^{s}},

\displaystyle f\big{(}\bm{\mathcal{Y}}_{1},\bm{\mathcal{Y}}_{2},\gamma(i,j)\big{)}=

\displaystyle f\big{(}\bm{\mathcal{Y}}_{1},\bm{\mathcal{Y}}_{2},\gamma(i,j)\big{)}=

i, j \sum γ (i, j) d_{W}^{2} (X_{i, 1}, X_{j, 2}) + arccos (i, j \sum γ (i, j) π_{i, 1} π_{j, 2}) .

d_{U} (Y_{1}, Y_{3}) + d_{U} (Y_{2}, Y_{3})

d_{U} (Y_{1}, Y_{3}) + d_{U} (Y_{2}, Y_{3})

= f (Y_{1}, Y_{3}, γ^{*} (i, h)) + f (Y_{2}, Y_{3}, γ^{*} (j, h))

\displaystyle=\frac{1}{k}\big{(}\!\sum_{i,h}\!\gamma^{*}(i,h)d^{2}_{W}(\bm{\mathcal{X}}_{i,1},\!\bm{\mathcal{X}}_{h,3})\!+\!\!\sum_{h,j}\!\gamma^{*}(h,j)d^{2}_{W}(\bm{\mathcal{X}}_{h,3},\!\bm{\mathcal{X}}_{j,2})\big{)}

+ arccos (i, h \sum γ^{*} (i, h) π_{i, 1} π_{h, 3}) + arccos (h, j \sum γ^{*} (h, j) π_{h, 3} π_{j, 2})

\displaystyle\geq\frac{1}{k}\sum_{i,j}\big{(}\gamma^{*}(i,h)\cap\gamma^{*}(h,j)\big{)}d^{2}_{W}(\bm{\mathcal{X}}_{i,1},\bm{\mathcal{X}}_{j,2})

\displaystyle~{}~{}+\arccos\big{(}\sum_{i,j}(\gamma^{*}(i,h)\cap\gamma^{*}(h,j))\sqrt{\pi_{i,1}\pi_{j,2}}\big{)}

\displaystyle=f\big{(}\bm{\mathcal{Y}}_{1},\bm{\mathcal{Y}}_{2},\gamma^{*}(i,h)\cap\gamma^{*}(h,j)\big{)}

\displaystyle\geq f\big{(}\bm{\mathcal{Y}}_{1},\bm{\mathcal{Y}}_{2},\gamma^{*}(i,j)\big{)}

= d_{U} (Y_{1}, Y_{2}) .

d_{W}^{2} (Y_{1}, Y_{2}) = η (Y_{1}, Y_{2}) in f \int_{m \times m} η (Y_{1}, Y_{2}) ∣∣ x_{1} - x_{2} ∣ ∣_{2}^{2} d x_{1} d x_{2},

d_{W}^{2} (Y_{1}, Y_{2}) = η (Y_{1}, Y_{2}) in f \int_{m \times m} η (Y_{1}, Y_{2}) ∣∣ x_{1} - x_{2} ∣ ∣_{2}^{2} d x_{1} d x_{2},

i, j \sum \frac{γ ^{*} ( i , j )}{k} d_{W}^{2} (X_{i, 1}, X_{j, 2}) =

i, j \sum \frac{γ ^{*} ( i , j )}{k} d_{W}^{2} (X_{i, 1}, X_{j, 2}) =

i, j \sum \frac{γ ^{*} ( i , j )}{k} η (X_{i, 1}, X_{j, 2}) in f \int_{m \times m} η (X_{i, 1}, X_{j, 2}) ∣∣ x_{1} - x_{2} ∣ ∣_{2}^{2} d x_{1} d x_{2}

= \int_{m \times m} i, j \sum \frac{γ ^{*} ( i , j )}{k} η^{*} (X_{i, 1}, X_{j, 2}) ∣∣ x_{1} - x_{2} ∣ ∣_{2}^{2} d x_{1} d x_{2},

arccos (i, j \sum γ (i, j) π_{i, 1} π_{j, 2}) = 0.

arccos (i, j \sum γ (i, j) π_{i, 1} π_{j, 2}) = 0.

d_{W}^{2} (Y_{1}, Y_{2}) \leq i, j \sum \frac{γ ^{*} ( i , j )}{k} d_{W}^{2} (X_{i, 1}, X_{j, 2}) = d_{U} (Y_{1}, Y_{2}) .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ShengxiLi/wass_emm
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Solving General Elliptical Mixture Models through an Approximate Wasserstein Manifold

Shengxi Li1, Zeyang Yu1, Min Xiang1, Danilo Mandic1

1Imperial College London

South Kensington Campus

London SW7 2AZ, UK

{shengxi.li17, z.yu17, m.xiang13, d.mandic}@imperial.ac.uk We thank the Imperial Lee Family Scholarship Funding for the support.

Abstract

We address the estimation problem for general finite mixture models, with a particular focus on the elliptical mixture models (EMMs). Compared to the widely adopted Kullback–Leibler divergence, we show that the Wasserstein distance provides a more desirable optimisation space. We thus provide a stable solution to the EMMs that is both robust to initialisations and reaches a superior optimum by adaptively optimising along a manifold of an approximate Wasserstein distance. To this end, we first provide a unifying account of computable and identifiable EMMs, which serves as a basis to rigorously address the underpinning optimisation problem. Due to a probability constraint, solving this problem is extremely cumbersome and unstable, especially under the Wasserstein distance. To relieve this issue, we introduce an efficient optimisation method on a statistical manifold defined under an approximate Wasserstein distance, which allows for explicit metrics and computable operations, thus significantly stabilising and improving the EMM estimation. We further propose an adaptive method to accelerate the convergence. Experimental results demonstrate the excellent performance of the proposed EMM solver.

Introduction

This paper establishes a general solution to the finite mixture model problem, which has been attracting extensive research effort for decades, due to both its simple representation and potential for universal approximation on arbitrary distributions in $\mathbb{R}^{M}$ . The finite mixture model also provides interpretable and statistical descriptions of data, which makes it a popular choice in a wide range of statistical learning paradigms, such as semi-supervised learning, capsule networks, and various image processing paradigms (e.g., de-noising, matching and registration).

The estimation on a finite mixture model boils down to a minimisation problem which considers the mixture of distributions as a parametric model $\rho(\bm{\theta})$ , which is then optimised through a minimisation of a certain discrepancy measure between $\rho(\bm{\theta})$ and empirical distributions of observed data $\rho^{*}$ , namely, $\min_{\bm{\theta}}d\left(\rho(\bm{\theta}),\rho^{*}\right)$ . This minimisation, although not explicitly stated, is a constrained problem because $\rho(\bm{\theta})$ must maintain the property of a probability density throughout, to ensure that $d(\cdot,\cdot)$ is tractable.

Due to this probability constraint, various advanced numerical algorithms (solvers) have been typically restricted by either the requirement of an increasingly flexible $\rho(\bm{\theta})$ or a powerful $d(\cdot,\cdot)$ . Such restrictions, for example, are one of the main rationales for using the expectation-maximisation (EM) algorithm to minimise the Kullback–Leibler (KL) divergence in Gaussian mixture model (GMM) problems (?). On the other hand, gradient-based numerical algorithms typically rest upon additional techniques that only work in particular situations (e.g., gradient reduction (?), positive definite projection (?), re-parametrisation (?) and Cholesky decomposition (?)). Besides the GMM, there exist other solutions that allow for flexible choices of $\rho(\bm{\theta})$ , which still belong to the EM-type methods (e.g., mixtures of t-distributions (?), Laplace distributions (?) and hyperbolic distributions (?)). Unfortunately, given other suitable candidates of distributions, those EM-type methods cannot ensure universal convergence (?; ?; ?), which dramatically limits the power of finite mixture models.

Another issue that has been highlighted in the literature is the sensitivity to initialisations when solving GMMs (?; ?). One of the main reasons is due to the use of KL divergence, which operates based on a “bin-to-bin” comparison between two density histograms. This means that mixtures which fall into a spurious local minimum cannot be corrected via the points outside. Indeed, with random initialisations for the GMM, Jin has proved that the EM algorithm or any other first-order method which minimises the KL divergence are highly likely to result in arbitrary bad local minima (?). This can also be easily verified from the non-smooth optimization space with various local optimum of the KL divergence as illustrated in Fig. 1-(b), even for estimating a simple GMM (Fig. 1-(a)). The gradients on the space are highly concentrated as well, which also leads to ill-posed gradient descent.

On the other hand, by virtue of the reflection of sample space (?) within the Wasserstein distance111Throughout this paper, the term Wasserstein distance refers to the square-Wasserstein distance. (?) which employs a “cross-bin” comparison, many practical benefits may be achieved in learning tasks (?). This property is particularly appealing in mixture model problems, where it ideally provides a comprehensive distance measure over all possible transport plans. The optimization space of the Wasserstein distance is also shown in Fig. 1-(c), of which smoothness is witnessed. Basically, there only exist global optimum in this case. The gradients on this space are also well behavioured, which promises to achieve superior optimisation. Most recently, Kolouri et al. (?) have adopted the Wasserstein distance for solving GMM problems. However, the aforementioned probability constraint enforces an extremely small stepsize (learning rate) during optimisation. Given the random projections of the sliced Wasserstein distance, this setup leads to extremely slow convergence or even non-convergence (especially in high dimensions as shown in our experiments). At the end of their optimisation, the EM algorithm is still needed to stabilised the algorithm.

Motivations and Contributions: Despite extensive benefits, optimising the GMM under the Wasserstein distance is still not feasible due to the probability constraint. To address this from a different perspective, we resort to the statistical manifold. We should further point out that our work is different from information geometry (?), as directly establishing a manifold in the whole density space of mixture models is absolutely intractable and cumbersome in optimisation (?). Another problem within the Wasserstein space is that the geodesic between two mixture models may not lie in mixture models of the same type (?), which leads to non-convergence.

We propose to resolve this problem by introducing an approximation to the Wasserstein distance, followed by establishing a statistical manifold via the so induced distance, which exhibits the desirable property of being complete within mixture models. The subsequent optimisation along this manifold intrinsically satisfies the probability constraint and ensures that the solution resides in the same mixture models. More importantly, minimising the induced distance is shown to be truly reducing the discrepancy between two mixture models, unlike most existing solutions which are based on the minimisation of the Euclidean distance from the optimal parameters. This ensures fast and stable convergence in optimisation. By realising that the existing Riemannian adaptive algorithms only make sense in updating vector parameters, we further develop a novel accelerated stochastic gradient descent method for updating the positive definite matrices.

In this way, our proposed framework makes it possible to incorporate a broad family of distributions, $\rho(\bm{\theta})$ , including an important class of multivariate analysis techniques called elliptical distributions (?) and to further investigate the mixture family termed the elliptical mixture model (EMM). We therefore provide computable and identifiable EMMs in a unified way, which demonstrates that EMMs are quite general and flexible and include the GMMs as special cases (?).

Overall, this paper proposes a complete and efficient framework for solving general EMM problems, by establishing a statistical manifold under an approximate Wasserstein distance which promotes stability and efficiency, together with an adaptive stochastic gradient algorithm to further accelerate the optimisation222The code of this paper is available at https://github.com/ShengxiLi/wass$\_$emm. Compared to the existing literature on mixture problems, the proposed solution achieves consistently superior performance not only in the GMM problems but also for general EMM problems. Our contributions can be summarised as follows:

•

A unified framework for dealing with computable and identifiable EMMs, which introduces a rich choice of candidates for flexible finite mixture models.

•

Establishment of the statistical manifold through the proposed approximate Wasserstein distance, which provides explicit and complete operations within the manifold.

•

An Adaptive accelerated Riemannian gradient descent algorithm on the established manifold, to improve the optimisation and accelerate convergence.

Computable and Identifiable EMMs

Elliptical distributions include a wide range of standard distributions, and it therefore comes as no surprise that a unified summary of computable candidates as components in the EMMs is a prerequisite to problem definition and subsequent solutions. A classical summary can be found in Chapter 3 in (?); however, despite progress this framework is not general enough as various elliptical distributions are still missing, and more importantly, it involves complicated representations for each type of elliptical distributions. The existing literature also employs different notations and formulations of particular distributions, which may lead to confusion. To this end, we provide a simple and unified framework for summarising the existing computable elliptical distributions via the stochastic representation, which can then be used to constitute flexible and identifiable EMMs.

Preliminaries on Elliptical Distributions

A random variable, $\bm{\mathcal{X}}\in\mathbb{R}^{m}$ , is said to exhibit an elliptical distribution if and only if it admits the following stochastic representation (?),

[TABLE]

where $\mathcal{R}\in\mathbb{R}^{+}$ is a non-negative real scalar random variable which models tail properties of the elliptical distribution; $\bm{\mathcal{S}}\in\mathbb{S}^{(m^{\prime}-1)}$ is a random vector which is uniformly distributed on a unit spherical surface333The term $\mathbb{S}^{m^{\prime}-1}$ is defined as $\mathbb{S}^{m^{\prime}-1}:=\{\mathbf{x}\in\mathbb{R}^{m^{\prime}}:\mathbf{x}^{T}\mathbf{x}=1\}$ . with the pdf within the class of $2\pi^{\nicefrac{{-m^{\prime}}}{{2}}}\Gamma(\nicefrac{{m^{\prime}}}{{2}})$ ; $\bm{\mu}\in\mathbb{R}^{m}$ is a mean (location) vector, while $\mathbf{\Lambda}\in\mathbb{R}^{m\times m^{\prime}}$ is a matrix that transforms $\bm{\mathcal{S}}$ from a sphere to an ellipse, and “ $=^{d}$ ” designates “the same distribution”. For a comprehensive review of elliptical distributions, we refer to (?).

When $m^{\prime}=m$ , that is, for a non-singular scatter matrix $\mathbf{\Sigma}=\mathbf{\Lambda}\mathbf{\Lambda}^{T}$ , the pdf for elliptical distributions does exist and has the following form

[TABLE]

In (2), the term $c_{m}$ serves as a normalisation term and relates solely to $m$ . We denote the Mahalanobis distance $(\mathbf{x}-\bm{\mu})^{T}\mathbf{\Sigma}^{-1}(\mathbf{x}-\bm{\mu})$ by $t$ . The density generator, $g(t)$ , can be explicitly expressed as $t^{-\nicefrac{{(m-1)}}{{2}}}p_{\mathcal{R}}(\sqrt{t})$ , where $t>0$ and $p_{\mathcal{R}}(t)$ denotes the pdf of $\mathcal{R}$ . Thus, $\mathcal{R}$ , or equivalently444The term $\mathcal{R}^{2}$ is frequently used in practice because $\mathcal{R}^{2}=^{d}(\mathbf{x}-\bm{\mu})^{T}\mathbf{\Sigma}^{-1}(\mathbf{x}-\bm{\mu})$ . $\mathcal{R}^{2}$ , fully characterises $g(\cdot)$ , i.e., the type of elliptical distributions. For example, when $\mathcal{R}^{2}=^{d}{\chi_{m}^{2}}$ ( $\chi_{m}^{2}$ denotes the chi-squared distribution of dimension $m$ ), then in (2), $g(t)\propto\mathrm{exp}(-\nicefrac{{t}}{{2}})$ , which formulates the multivariate Gaussian distribution. Therefore, the elliptical distribution can be fully characterised by $\bm{\mu}$ , $\mathbf{\Sigma}$ and $\mathcal{R}$ . For simplicity, the elliptical distribution in (2) will be denoted by $\bm{\mathcal{X}}\sim\mathcal{E}(\mathbf{x};\bm{\mu},\mathbf{\Sigma},\mathcal{R})$ , where $\mathcal{E}(\mathbf{x};\bm{\mu},\mathbf{\Sigma},\mathcal{R})=c_{m}\mathrm{det}(\mathbf{\Sigma})^{-1/2}g(t)$ of (2).

Computable and Identifiable EMMs

Due to the fact that the $\mathcal{R}^{2}$ decides the type of elliptical distributions, we here provide a unified summary of elliptical distributions in Table Computable and Identifiable EMMs; this is achieved through stochastic representations in (1). This makes it possible to avoid complicated formulations, and to instead classify different categories simply through several typical distributions of $\mathcal{R}^{2}$ , which also allows for simple and intuitive sample generations for elliptical distributions. The proof for expressions in this table is provided in Appendix. Uniquely, this further clarifies the commonalities between the members of the elliptical family of distributions. More importantly, constructing an EMM with the candidates in Table Computable and Identifiable EMMs can be easily proved to be identifiable based on Theorem 2 in (?). It is thus convenient and safe to establish a well-defined EMM by the candidates summarised in Table Computable and Identifiable EMMs.

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[Absil, Mahony, and Sepulchre 2009] Absil, P.-A.; Mahony, R.; and Sepulchre, R. 2009. Optimization algorithms on matrix manifolds . Princeton University Press.
2[Amari 1998] Amari, S.-I. 1998. Natural gradient works efficiently in learning. Neural Computation 10(2):251–276.
3[Andrews and Mallows 1974] Andrews, D. F., and Mallows, C. L. 1974. Scale mixtures of normal distributions. Journal of the Royal Statistical Society. Series B (Methodological) 99–102.
4[Arbelaez et al . 2011] Arbelaez, P.; Maire, M.; Fowlkes, C.; and Malik, J. 2011. Contour detection and hierarchical image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 33(5):898–916.
5[Arjovsky, Chintala, and Bottou 2017] Arjovsky, M.; Chintala, S.; and Bottou, L. 2017. Wasserstein Gan. ar Xiv preprint ar Xiv:1701.07875 .
6[Barndorff-Nielsen, Kent, and Sørensen 1982] Barndorff-Nielsen, O.; Kent, J.; and Sørensen, M. 1982. Normal variance-mean mixtures and z distributions. International Statistical Review/Revue Internationale de Statistique 145–159.
7[Becigneul and Ganea 2019] Becigneul, G., and Ganea, O.-E. 2019. Riemannian adaptive optimization methods. In Proceedings of the International Conference on Learning Representations .
8[Browne and Mc Nicholas 2015] Browne, R. P., and Mc Nicholas, P. D. 2015. A mixture of generalized hyperbolic distributions. Canadian Journal of Statistics 43(2):176–198.