Analysis of a Generalized Expectation-Maximization Algorithm for   Gaussian Mixture Models: A Control Systems Perspective

Sarthak Chatterjee; Orlando Romero; S\'ergio Pequito

arXiv:1903.00979·math.OC·May 19, 2021

Analysis of a Generalized Expectation-Maximization Algorithm for Gaussian Mixture Models: A Control Systems Perspective

Sarthak Chatterjee, Orlando Romero, S\'ergio Pequito

PDF

Open Access

TL;DR

This paper analyzes a generalized EM algorithm for Gaussian mixture models from a control systems perspective, revealing its convergence properties and design advantages using robust control theory tools.

Contribution

It introduces a control-theoretic analysis of a generalized EM algorithm, providing new insights into its convergence and design for Gaussian mixture models.

Findings

01

GEM can be modeled as an LTI system with feedback nonlinearity.

02

Convergence properties are analyzed using robust control theory.

03

The approach offers a pedagogical example demonstrating advantages.

Abstract

The Expectation-Maximization (EM) algorithm is one of the most popular methods used to solve the problem of parametric distribution-based clustering in unsupervised learning. In this paper, we propose to analyze a generalized EM (GEM) algorithm in the context of Gaussian mixture models, where the maximization step in the EM is replaced by an increasing step. We show that this GEM algorithm can be understood as a linear time-invariant (LTI) system with a feedback nonlinearity. Therefore, we explore some of its convergence properties by leveraging tools from robust control theory. Lastly, we explain how the proposed GEM can be designed, and present a pedagogical example to understand the advantages of the proposed approach.

Equations103

x \in X : = {x \in R^{n} : p_{θ} (x, y) > 0}

x \in X : = {x \in R^{n} : p_{θ} (x, y) > 0}

L (θ) : = p_{θ} (y) = ⎩ ⎨ ⎧ \int_{X} p_{θ} (x, y) d x, x \in X \sum p_{θ} (x, y), if x is continuous, if x is discrete.

L (θ) : = p_{θ} (y) = ⎩ ⎨ ⎧ \int_{X} p_{θ} (x, y) d x, x \in X \sum p_{θ} (x, y), if x is continuous, if x is discrete.

Q (θ, θ^{'})

Q (θ, θ^{'})

= \int_{X} p_{θ^{'}} (x ∣ y) lo g p_{θ} (x, y) d x,

\theta^{(k+1)}=\theta^{(k)}+\eta\frac{\partial Q(\theta,\theta^{(k)})}{\partial\theta}\Big{|}_{\theta=\theta^{(k)}},

\theta^{(k+1)}=\theta^{(k)}+\eta\frac{\partial Q(\theta,\theta^{(k)})}{\partial\theta}\Big{|}_{\theta=\theta^{(k)}},

p_{θ_{i}} (y) = \frac{α _{i}}{det ( 2 π Σ _{i} )} e^{- \frac{1}{2} (y - μ_{i})^{T} Σ_{i}^{- 1} (y - μ_{i})},

p_{θ_{i}} (y) = \frac{α _{i}}{det ( 2 π Σ _{i} )} e^{- \frac{1}{2} (y - μ_{i})^{T} Σ_{i}^{- 1} (y - μ_{i})},

θ = [α^{T}, μ^{T}, vec [Σ]^{T}]^{T},

θ = [α^{T}, μ^{T}, vec [Σ]^{T}]^{T},

θ^{(k + 1)} = θ^{(k)} + u^{(k)},

θ^{(k + 1)} = θ^{(k)} + u^{(k)},

α_{j}^{(k + 1)} = \frac{1}{N} t = 1 \sum N h_{j}^{(k)} (t),

α_{j}^{(k + 1)} = \frac{1}{N} t = 1 \sum N h_{j}^{(k)} (t),

μ_{j}^{(k + 1)} = \frac{1}{\sum _{t = 1}^{N} h _{j}^{(k)} ( t )} t = 1 \sum N h_{j}^{(k)} (t) x^{(t)},

μ_{j}^{(k + 1)} = \frac{1}{\sum _{t = 1}^{N} h _{j}^{(k)} ( t )} t = 1 \sum N h_{j}^{(k)} (t) x^{(t)},

Σ_{j}^{(k + 1)} = \frac{1}{\sum _{t = 1}^{N} h _{j}^{(k)} ( t )} t = 1 \sum N h_{j}^{(k)} (t) z_{j}^{(t), (k + 1)} (z_{j}^{(t), (k + 1)})^{T},

Σ_{j}^{(k + 1)} = \frac{1}{\sum _{t = 1}^{N} h _{j}^{(k)} ( t )} t = 1 \sum N h_{j}^{(k)} (t) z_{j}^{(t), (k + 1)} (z_{j}^{(t), (k + 1)})^{T},

z_{j}^{(t), (k + 1)} = x^{(t)} - μ_{j}^{(k + 1)},

z_{j}^{(t), (k + 1)} = x^{(t)} - μ_{j}^{(k + 1)},

h_{j}^{(k)} (t) = \frac{α _{j}^{(k)} p ( x ^{(t)} ∣ μ _{j}^{(k)} , Σ _{j}^{(k)} )}{\sum _{i = 1}^{K} α _{i}^{(k)} p ( x ^{(t)} ∣ μ _{i}^{(k)} , Σ _{i}^{(k)} )} .

h_{j}^{(k)} (t) = \frac{α _{j}^{(k)} p ( x ^{(t)} ∣ μ _{j}^{(k)} , Σ _{j}^{(k)} )}{\sum _{i = 1}^{K} α _{i}^{(k)} p ( x ^{(t)} ∣ μ _{i}^{(k)} , Σ _{i}^{(k)} )} .

Σ_{j}^{(k + 1)} = \frac{1}{\sum _{t = 1}^{N} h _{j}^{(k)} ( t )} t = 1 \sum N h_{j}^{(k)} (t) z_{j}^{(t), (k)} (z_{j}^{(t), (k)})^{T},

Σ_{j}^{(k + 1)} = \frac{1}{\sum _{t = 1}^{N} h _{j}^{(k)} ( t )} t = 1 \sum N h_{j}^{(k)} (t) z_{j}^{(t), (k)} (z_{j}^{(t), (k)})^{T},

α^{(k + 1)} - α^{(k)} = P_{α^{(k)}} \frac{\partial L}{\partial α}_{α = α^{(k)}},

α^{(k + 1)} - α^{(k)} = P_{α^{(k)}} \frac{\partial L}{\partial α}_{α = α^{(k)}},

μ_{j}^{(k + 1)} - μ_{j}^{(k)} = P_{μ_{j}^{(k)}} \frac{\partial L}{\partial μ _{j}}_{μ_{j} = μ_{j}^{(k)}},

μ_{j}^{(k + 1)} - μ_{j}^{(k)} = P_{μ_{j}^{(k)}} \frac{\partial L}{\partial μ _{j}}_{μ_{j} = μ_{j}^{(k)}},

vec [Σ_{j}^{(k + 1)}] - vec [Σ_{j}^{(k)}] = P_{Σ_{j}^{(k)}} \frac{\partial L}{\partial vec [ Σ _{j} ]}_{Σ_{j} = Σ_{j}^{(k)}},

vec [Σ_{j}^{(k + 1)}] - vec [Σ_{j}^{(k)}] = P_{Σ_{j}^{(k)}} \frac{\partial L}{\partial vec [ Σ _{j} ]}_{Σ_{j} = Σ_{j}^{(k)}},

P_{α^{(k)}}

P_{α^{(k)}}

P_{μ_{j}^{(k)}}

P_{Σ_{j}^{(k)}}

θ^{(k + 1)} = θ^{(k)} + P (θ^{(k)}) \nabla L (θ^{(k)}),

θ^{(k + 1)} = θ^{(k)} + P (θ^{(k)}) \nabla L (θ^{(k)}),

P (θ) = diag [P_{α}, P_{μ_{1}}, \dots, P_{μ_{K}}, P_{Σ_{1}}, \dots, P_{Σ_{K}}] .

P (θ) = diag [P_{α}, P_{μ_{1}}, \dots, P_{μ_{K}}, P_{Σ_{1}}, \dots, P_{Σ_{K}}] .

θ^{(k + 1)} \in Θ = {θ : j = 1 \sum K α_{j} = 1, Σ_{j} = Σ_{j}^{T} ≻ 0} .

θ^{(k + 1)} \in Θ = {θ : j = 1 \sum K α_{j} = 1, Σ_{j} = Σ_{j}^{T} ≻ 0} .

Θ_{s} = {θ^{'} : j = 1 \sum K α_{j} = 0},

Θ_{s} = {θ^{'} : j = 1 \sum K α_{j} = 0},

θ^{(k + 1)}

θ^{(k + 1)}

= θ^{(k)} + E E^{T} P (θ^{(k)}) \nabla L (θ^{(k)})

k \to \infty lim sup \frac{∥ θ ^{(k + 1)} - θ ^{⋆} ∥}{∥ θ ^{(k)} - θ ^{⋆} ∥ ^{β}} = ρ < \infty,

k \to \infty lim sup \frac{∥ θ ^{(k + 1)} - θ ^{⋆} ∥}{∥ θ ^{(k)} - θ ^{⋆} ∥ ^{β}} = ρ < \infty,

[θ - θ^{⋆} \nabla f (θ) - \nabla f (θ^{⋆})]^{T} [- 2 μ_{f} L I (L + μ_{f}) I (L + μ_{f}) I - 2 I] [θ - θ^{⋆} \nabla f (θ) - \nabla f (θ^{⋆})] \geq 0

[θ - θ^{⋆} \nabla f (θ) - \nabla f (θ^{⋆})]^{T} [- 2 μ_{f} L I (L + μ_{f}) I (L + μ_{f}) I - 2 I] [θ - θ^{⋆} \nabla f (θ) - \nabla f (θ^{⋆})] \geq 0

ξ [k + 1]

ξ [k + 1]

θ [k]

u [k]

[A^{T} B^{T}] R [A^{T} B^{T}]^{T} - [ρ^{2} R 0 00] + λ [C 0 D I]^{T} [- 2 μ_{f} L I (L + μ_{f}) I (L + μ_{f}) I - 2 I] [C 0 D I] ⪯ 0

[A^{T} B^{T}] R [A^{T} B^{T}]^{T} - [ρ^{2} R 0 00] + λ [C 0 D I]^{T} [- 2 μ_{f} L I (L + μ_{f}) I (L + μ_{f}) I - 2 I] [C 0 D I] ⪯ 0

ρ \leq max {∣1 - μ_{f} ∣, ∣1 - L ∣} .

ρ \leq max {∣1 - μ_{f} ∣, ∣1 - L ∣} .

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGaussian Processes and Bayesian Inference · Bayesian Methods and Mixture Models · Target Tracking and Data Fusion in Sensor Networks

Full text

Analysis of a Generalized Expectation-Maximization Algorithm for Gaussian Mixture Models: A Control Systems Perspective

\nameSarthak Chatterjeea, Orlando Romerob, and Sérgio Pequitob CONTACT Sarthak Chatterjee. Email: [email protected] aDepartment of Electrical, Computer, and Systems Engineering, Rensselaer Polytechnic Institute, Troy NY, 12180, USA; bDepartment of Industrial and Systems Engineering, Rensselaer Polytechnic Institute, Troy NY, 12180, USA

Abstract

The Expectation-Maximization (EM) algorithm is one of the most popular methods used to solve the problem of parametric distribution-based clustering in unsupervised learning. In this paper, we propose to analyze a generalized EM (GEM) algorithm in the context of Gaussian mixture models, where the maximization step in the EM is replaced by an increasing step. We show that this GEM algorithm can be understood as a linear time-invariant (LTI) system with a feedback nonlinearity. Therefore, we explore some of its convergence properties by leveraging tools from robust control theory. Lastly, we explain how the proposed GEM can be designed, and present a pedagogical example to understand the advantages of the proposed approach.

keywords:

Statistical data analysis, Linear multivariable systems, Output regulation, Robust control applications.

1 Introduction

A fundamental problem in unsupervised learning is the problem of clustering, where the task in question is to group certain objects of interest into subgroups called clusters, such that all objects in a particular cluster share features (in some predefined sense) with each other, but not with objects in other clusters (Tan et al., 2005; Bishop, 2006).

The Expectation-Maximization (EM) algorithm (Dempster et al., 1977) is one of the most commonly used methods in parametric distribution-based clustering analysis (Nowak, 2003). Notably, Gaussian mixture models (GMMs) (i.e., a finite convex combination of multivariate Gaussian distributions) have found several applications in real-world problems (Tan et al., 2005; Bishop, 2006). In this setting, clustering consists of estimating the parameters in a GMM that maximize its likelihood function (iteratively maximized through the EM algorithm), followed by assigning to each data point the ‘cluster’ corresponding to its most likely multivariate Gaussian distribution in the GMM.

The convergence of the EM algorithm is well-studied in the literature (Wu, 1983), particularly in the context of determining the parameters of GMMs (Xu and Jordan, 1996). Nonetheless, it is worth analyzing the EM algorithm as a dynamical system, and possibly gain insights that enable us to design more efficient variations of the EM algorithm. Therefore, in Romero et al. (2019), the authors proposed to change the perspectives on local optimizers and convergence of the EM algorithm by assessing, respectively, the equilibria and asymptotic stability (in the sense of Lyapunov) of a nonlinear dynamical system that represents the standard EM algorithm, through explicit use of discrete-time Lyapunov stability theory.

In this paper, we build upon the recent work in Romero et al. (2019) and propose to analyze a generalized EM (GEM) algorithm (Dempster et al., 1977; Neal and Hinton, 1998) in the context of Gaussian mixture models, where the maximization step in the EM is replaced by an increasing step. GEM algorithms have also been used in applications such as computer vision (Fessler and Hero, 1995) and noise estimation in communication channels (Krisjansson et al., 2001), and, in general, the study of the EM algorithm and its myriad variants constitute an active area of research (Moon, 1996; Roche, 2011). The main contributions of this work are as follows. First, we show that this GEM algorithm can be understood as a linear time-invariant (LTI) system with a feedback nonlinearity. Secondly, we explore some of its convergence properties by leveraging tools from robust control theory. Lastly, we explain how the proposed GEM can be designed, and present a pedagogical example to understand the advantages of the proposed approach.

2 Problem Statement

Let $\theta\in\Theta\subseteq\mathbb{R}^{p}$ be some vector of unknown (but deterministic) parameters characterizing a distribution of interest, which we seek to infer from a collected dataset $y\in\mathbb{R}^{d}$ (from now on assumed fixed) and a statistical model composed by a family of joint probability density or mass functions (possibly mixed) $p_{\theta}(x,y)$ indexed by $\theta\in\Theta$ , where

[TABLE]

is some latent (hidden) random vector.

The EM algorithm seeks to find a local maximizer of the incomplete likelihood function $\mathcal{L}:\Theta\to\mathbb{R}$ given by

[TABLE]

The mapping $\theta\mapsto p_{\theta}(x,y)$ is, naturally, referred to as the complete likelihood function. To optimize $\mathcal{L}(\theta)$ , the EM algorithm alternates at each iteration $k$ between two steps. First, in the expectation step (E-step), we compute $Q(\theta,\theta^{(k)})$ , defined through

[TABLE]

so that $Q(\cdot,\theta^{(k)})$ denotes the expected value of the complete log-likelihood function with respect to $\theta=\theta^{(k)}$ . Second, in the maximization step (M-step), we maximize $Q(\cdot,\theta^{(k)})$ and update the current estimate as that maximizer.

Before formally stating the EM algorithm, let us make some mild simplifying assumptions that will avoid pathological behavior on the $Q$ -function, $Q:\Theta\times\Theta\to\mathbb{R}$ .

Assumption 1.

$\mathcal{X}$ does not depend on $\theta\in\Theta$ and has positive Lebesgue measure.

Assumption 2.

$\mathcal{L}$ is twice continuously differentiable in $\Theta$ .

Notice that, from Assumption 1, the conditional distribution $p_{\theta^{\prime}}(x|y)=p_{\theta^{\prime}}(x,y)/p_{\theta^{\prime}}(y)$ is well defined in $\mathcal{X}$ , since $p_{\theta}(y)>0$ for every $\theta\in\Theta$ . Finally, we make the following simplifying assumption, which makes the M-step well defined.

Assumption 3.

$Q(\cdot,\theta^{\prime})$ has a unique global maximizer in $\Theta$ .

With all these ingredients and assumptions, we summarize the EM algorithm in Algorithm 1.

However, it is to be kept in mind that when we implement the EM algorithm, for most parametric distributions, we do not obtain a closed-form expression for the M-step. As a consequence, to determine a solution (i.e., an approximation) in the M-step, we need to rely on numerical optimization schemes. For instance, we can consider first-order optimization algorithms (e.g., gradient ascent), i.e.,

[TABLE]

for some $\eta>0$ . Notice that this could constitute a problem by itself since first-order algorithms are known to have slow convergence rates that get aggravated by the increase in the dimension of the search space. Furthermore, any variant of Algorithm 1 that does not explicitly maximize $Q(\cdot,\theta^{(k)})$ at the M-step, but instead is such that $Q(\theta^{(k+1)},\theta^{(k)})>Q(\theta^{(k)},\theta^{(k)})$ is referred to as a generalized EM (GEM) algorithm.

As previously mentioned, a particularly important class of models are the Gaussian mixture models (GMMs). In these models, each component of the mixture is given by

[TABLE]

with $i=1,2,\ldots,K$ , $y,\mu_{i},\in\mathbb{R}^{d}$ , $\Sigma_{i}\in\mathbb{R}^{d\times d}$ is positive definite, and $\alpha_{i}\in[0,1]$ such that $\sum_{i=1}^{K}\alpha_{i}=1$ . The vector of unknown parameters $\theta$ lumps together the scalar parameters within $\alpha_{i},\mu_{i},\Sigma_{i}$ for $i\in\{1,\ldots,K\}$ , as follows:

[TABLE]

where $\alpha^{\mathsf{T}}=\left[\alpha_{1},\ldots,\alpha_{K}\right]^{\mathsf{T}}$ , $\mu^{\mathsf{T}}=\left[\mu_{1},\ldots,\mu_{K}\right]^{\mathsf{T}}$ , and $\operatorname{vec}\left[\Sigma\right]=\left[\operatorname{vec}\left[\Sigma_{1}\right]^{\mathsf{T}},\ldots,\operatorname{vec}\left[\Sigma_{K}\right]^{\mathsf{T}}\right]^{\mathsf{T}}$ , with $\operatorname{vec}(M)$ denoting the vector obtained by stacking the column vectors of $M$ .

In this setting, an alternative is to replace the M-step by (4), and we obtain a GEM that is able to recover similar (asymptotic) convergence rates available in the literature (Balakrishnan et al., 2017). Nonetheless, (asymptotic) convergence rates can be misleading as they do not reflect the practical number of iterations required to converge. Furthermore, as it is clear in the GMM, there are some additional constraints that are implicit and are not necessarily satisfied by (4) (i.e., $\alpha_{1}+\ldots+\alpha_{K}=1$ and $\Sigma_{i}\succ 0$ for $i=1,\ldots,K$ ).

That said, we need to further understand the transient and the local behavior of the GEM algorithm, for which dynamical systems theory provides us with the proper framework. Subsequently, in this paper, we propose to step away from the dynamics without an explicit control (e.g., the M-step in Algorithm 1), towards one where we can consider an additive control, and therefore, study its properties.

In summary, we seek to address the following questions.

Problem 1.

Is it possible to replace the M-step in Algorithm 1 by a parameter update step given by

[TABLE]

where we can design a feedback control law $u^{(k)}=\phi(\theta^{(k)})$ to obtain a GEM algorithm? 2. 2.

What insights (particularly with respect to design) can such control laws provide us with?

3 Main Results

In this section, we provide the main result of the paper. Specifically, we show how we can leverage tools from control systems theory to analyze a GEM algorithm as an LTI system connected in feedback with a nonlinearity. Furthermore, we also show how to derive the convergence rate for such an algorithm using tools from robust control. Lastly, we briefly describe how we can look into certain aspects of designing new GEM-like algorithms.

3.1 GEM Algorithms as LTI Systems with a Feedback Nonlinearity

We first show how we can leverage tools from dynamical systems and control theory to cast a GEM algorithm into the framework of an LTI system with an interconnected feedback nonlinearity. We begin with the following Lemma that provides us with expressions for the closed-form solution of the problem of estimating the parameters of a GMM using a generalized EM algorithm.

Lemma 3.1 (Dempster et al. (1977)).

Given $K$ possible mixtures in the GMM, and independently and identically distributed (i.i.d.) samples $\{x^{(t)}\}_{t=1}^{N}$ , we can estimate the parameter vector $\theta$ by maximizing the log-likelihood $\mathcal{L}(\theta)$ , that, in the context of a GMM has a closed-form solution given as follows:

[TABLE]

and

[TABLE]

with

[TABLE]

where the posterior probabilities $h_{j}^{(k)}$ are given by

[TABLE]

With the above closed-form solution, if we, instead, consider a ‘shifted-update’ of the covariance as

[TABLE]

(i.e., the update of $\Sigma_{j}^{(k+1)}$ is done with respect to $\mu_{j}^{(k)}$ instead of $\mu_{j}^{(k+1)}$ ), we can summarize in the following Lemma the relationships between the updates of the parameters of the GMM that we aim to estimate, i.e., the mixing weights, the means, and the covariance matrices.

Lemma 3.2.

For the shifted updates of the covariance matrices considered in (13), the following relations hold:

[TABLE]

and

[TABLE]

with

[TABLE]

where $j\in\{1,\ldots,K\}$ denotes the indices of the mixture components, $k$ denotes the iteration number, and $\otimes$ denotes the Kronecker product.

Therefore, by combining the equations (14)-(16) of Lemma 3.2, we can briefly write the evolution of the parameters as

[TABLE]

where

[TABLE]

In this case, the term $\phi(\theta^{(k)})=P\left(\theta^{(k)}\right)\nabla\mathcal{L}(\theta^{(k)})$ could be understood as a nonlinearity driving the system. Nonetheless, it is not guaranteed that some essential implicit constraints hold, i.e.,

[TABLE]

Therefore, towards incorporating these inter-dependencies, we can consider the ‘shifted’ subspace

[TABLE]

where $\theta^{\prime}=\theta-\theta_{0}\in\Theta$ for a shift $\theta_{0}$ . Furthermore, let the coordinates of $\theta^{\prime}$ under the basis $\{e_{1},\ldots,e_{m}\}$ be denoted by $\theta_{c}$ , where the $e_{i}$ -s are canonical (orthonormal) basis vectors and $m$ is the dimension of $\Theta$ . Then, $\theta-\theta_{0}=E\theta_{c}$ , or equivalently, $\theta=E\theta_{c}+\theta_{0}$ , where $E=[e_{1},\ldots,e_{m}]$ .

Now, notice that $E^{\mathsf{T}}\theta=E^{\mathsf{T}}E\theta_{c}+E^{\mathsf{T}}\theta_{0}$ (or, equivalently, $\theta_{c}=E^{\mathsf{T}}\theta-E^{\mathsf{T}}\theta_{0}$ ), by multiplying both sides by $E^{\mathsf{T}}$ , and noticing that $E^{\mathsf{T}}E=I$ . Thus, $\theta^{\prime}=\theta-\theta_{0}=E^{\mathsf{T}}E\left(\theta-\theta_{0}\right)=EE^{\mathsf{T}}\theta^{\prime}\in\Theta$ , by observing that $\Theta$ is an open convex set since we only consider local differential properties of the log-likelihood, and consequently, the constraint on positive definiteness of $\Sigma_{j}$ holds.

Therefore,

[TABLE]

belongs to $\Theta$ , which constitutes the parameter update of a GEM algorithm that we shall refer to as projection-based GEM (PB-GEM) – see Algorithm 2.

Consequently, the term $\phi(\theta^{(k)})=EE^{\mathsf{T}}\left.P\left(\theta^{(k)}\right)\frac{\partial\mathcal{L}}{\partial\theta}\right|_{\theta=\theta^{(k)}}$ can be understood as a nonlinearity driving a linear time-invariant (LTI) system. As such, we can consider $\phi(\theta)=\nabla f(\theta)$ to be (locally) Lipschitz and for which there is a (locally) strongly convex function $f$ . Before stating the theorem that shows the rate of convergence for the PB-GEM algorithm, we introduce the following preliminary definitions and results.

Definition 3.3 (Q-convergence (Jay, 2001)).

Given a sequence $\{\theta^{(k)}\}\to\theta^{\star}$ with $\theta^{(k)}\neq\theta^{\star}$ for $k=0,1,2,\ldots$ , the order of convergence $\beta$ is a nonnegative number satisfying

[TABLE]

with $\rho$ being the rate of convergence.

Definition 3.4 (Sector Integral Quadratic Constraint (IQC) for the gradient map).

For a strongly convex function $f$ with strong convexity parameter $\mu_{f}$ , having Lipschitz continuous gradients with Lipschitz constant $L$ , the gradient map $\nabla f$ satisfies the sector IQC defined by

[TABLE]

for all $\theta,\theta^{\star}$ .

Lemma 3.5 (A modified version of Theorem 4 in Lessard et al. (2016)).

Consider a first-order linear optimization scheme represented as the dynamical system

[TABLE]

with nonlinearity $\phi(\theta)=\nabla f(\theta)$ . If $\nabla f$ satisfies the sector IQC defined by (26), then the linear matrix inequality (LMI)

[TABLE]

*is feasible for some $R\succ 0$ , $\lambda\geq 0$ . Specifically, $\{\xi[k]\}\to\xi^{\star}$ with respect to a suitable norm $\|\cdot\|$ , with a convergence rate of $\rho$ , where $\xi^{\star}$ is a fixed point of (27) satisfying $\xi^{\star}=A\xi^{\star}$ . *

With the above ingredients, we are ready to state our main result concerning the convergence rate of the PB-GEM algorithm, which builds upon tools from robust control theory.

Theorem 3.6.

Consider a function $f(\theta)$ that is $\mu_{f}$ -strongly convex, has an $L$ -Lipschitz gradient, and satisfies $\nabla f(\theta)=EE^{\mathsf{T}}P(\theta)\nabla\mathcal{L}(\theta)$ . Then, $\theta^{(k+1)}=\theta^{(k)}+u^{(k)}$ , with $u^{(k)}=\nabla f(\theta^{(k)})$ is a GEM algorithm (i.e., $\{\theta^{(k)}\}\to\theta^{\star}$ , where $\theta^{\star}$ is the maximum-likelihood estimate) with convergence rate $\rho$ bounded by

[TABLE]

Proof.

That the Projection-Based GEM algorithm presented in Algorithm 2 indeed constitutes a generalized EM can be shown using an argument similar to one presented in Salakhutdinov et al. (2003). In particular, if Assumption 3 is satisfied for models of the exponential family (a special case being the GMMs considered in this paper), the PB-GEM algorithm evolves in a way such that we have $\mathcal{L}(\theta^{(k+1)})>\mathcal{L}(\theta^{(k)})$ for all $k\in\mathbb{Z}_{+}$ , provided $\nabla\mathcal{L}(\theta^{(0)})\neq 0$ . Secondly, we notice that the iterative scheme can be represented as the LTI system in (27) with a feedback nonlinearity given by $\phi(\cdot)=\nabla f(\cdot)$ .

Due to the regularity of $f$ , and the fact that the PB-GEM algorithm can be represented as the dynamical system (27), with $A=B=C=I$ and $D=0$ , we can invoke the results of Definition 3.4 and Lemma 3.5 to recover bounds on the convergence rate of the PB-GEM algorithm using the LMI in (28). Remarkably, due to the general block-diagonal structure of optimization algorithms like gradient ascent, we can then use a ‘lossless dimensionality reduction argument’ and reduce the case of the feasibility of the above LMI to analyze the corresponding semidefinite program for the single-dimensional case without loss of generality (Lessard et al., 2016).

This ascertains the local convergence for the maximum of the function $f$ as long as the following LMI holds

[TABLE]

for some scalar $R>0$ , and $\lambda\geq 0$ , where $\rho\in(0,1)$ denotes the convergence rate. Since $R$ is a scalar, we can consider $R=1$ without loss of generality. This gives us the LMI

[TABLE]

As a consequence, to ensure the negative semidefiniteness of the above matrix, both $1-2\lambda$ (which is present in the bottom right block) and the Schur complement of the bottom right block need to be negative semidefinite. Thus, we have

[TABLE]

Combining these two, we have

[TABLE]

which yields $\rho\leq\max\{|1-\mu_{f}|,|1-L|\}$ . ∎

Additionally, the transformation matrix $P(\cdot)$ also provides us with valuable insights regarding the rate of convergence of the PB-GEM algorithm. Indeed, differentiating the equation

[TABLE]

we have,

[TABLE]

where $\frac{\partial P}{\partial\theta}=\begin{bmatrix}\frac{\partial P}{\partial\theta^{1}}&\ldots&\frac{\partial P}{\partial\theta^{m}}\end{bmatrix}$ with $\theta=\left[\theta^{1},\ldots,\theta^{m}\right]$ ,

[TABLE]

and $S(\theta)=\nabla^{2}\mathcal{L}(\theta)$ denotes the Hessian matrix of $\mathcal{L}(\cdot)$ .

Therefore, near a stationary point $\theta=\theta^{\star}$ of $\mathcal{L}(\theta)$ over which $P(\theta)$ is bounded, we have

[TABLE]

As a consequence, it follows that, under the aforementioned conditions, the projection-based GEM algorithm exhibits superlinear convergence when $\nabla\mathcal{L}(\theta)$ approaches zero. In particular, the nature of convergence is dictated by the eigenvalues of the matrix $\frac{\partial F^{\mathrm{PB-GEM}}}{\partial\theta}(\theta)$ . If the eigenvalues are near zero, then the transformation matrix scales the EM update step by approximately the scaled negative inverse Hessian, and the EM algorithm behaves like Newton’s method. On the other hand, if the eigenvalues are near unity (in absolute value), then the PB-GEM algorithm exhibits first-order convergence.

3.2 Towards the Design of GEM Algorithms

We can, therefore, propose to design a GEM algorithm by changing the control law. Nonetheless, we have to be careful with the updates on the different parameters as, implicitly, they possess constraints on the updates. Specifically, we require the $\alpha$ -s to sum up to unity, and the $\Sigma$ -s to be symmetric positive definite.

Subsequently, in what follows, we focus only on the change of the mean by considering the following weighted function $f_{W}(\theta)$ such that $f_{W}(\theta)$ satisfies

[TABLE]

where $D=\text{diag}(I_{K},W,I_{d^{2}K})$ , and with $W\in\mathbb{R}^{dK\times dK}$ being a weight matrix that mixes the different means. In particular, we can consider $W=\text{diag}(\beta_{1}I_{d},\ldots,\beta_{K}I_{d})$ where $\beta_{i}>0$ denotes a scaling of the mean similar to a learning rate but applied only on the component rates of the means of the mixture model. Note that we can extend this design step only on the means because the means are the only parameters of the GMMs under consideration that do not have implicit constraints associated with them. This allows us to introduce the following parameter update step

[TABLE]

for an algorithm which we will refer to as the weighted projection-based GEM (W-PB-GEM) algorithm – see Algorithm 3.

As a result, we have the following corollary on the convergence rate of the W-PB-GEM algorithm.

Corollary 3.7.

Suppose that there exists a function $f_{W}(\theta)$ that is $\mu_{f}$ -strongly convex, has an $L$ -Lipschitz gradient, and satisfies $\nabla f_{W}(\theta)=EE^{\mathsf{T}}DP(\theta)\nabla\mathcal{L}(\theta)$ , where $D=\text{diag}(I_{K},W,I_{d^{2}K})$ , and with $W\in\mathbb{R}^{dK\times dK}$ being the matrix of weights that determine the component-wise mixture of the means of the GMM whose parameters are to be estimated. Then, $\theta^{(k+1)}=\theta^{(k)}+u^{(k)}$ , with $u^{(k)}=\nabla f_{W}(\theta^{(k)})$ is a GEM algorithm (i.e, $\{\theta^{(k)}\}\to\theta^{\star}$ , where $\theta^{\star}$ is the maximum-likelihood estimate) with convergence rate $\rho$ bounded by

[TABLE]

*Remark 1**.*

The convergence rates obtained for the W-PB-GEM algorithm are the same as those obtained for the PB-GEM algorithm. It is to be noted, however, that the update equations associated with the $\alpha$ -s and the $\Sigma$ -s cannot be arbitrarily changed because of the explicit constraints associated with them.

*Remark 2**.*

It is worth mentioning here that the convergence rates as obtained in (29) and (40) are merely upper bounds, and, unfortunately, do not shed any light on the transient behavior of the PB-GEM or W-PB-GEM algorithm – see the inset of Figures 2 and 4.

4 Pedagogical Examples

In this section, we seek to demonstrate a pedagogical example that shows the efficacy of the methods extended in this paper in identifying the parameters of unknown GMMs. To do this, we first sample $1000$ arbitrary points from a mixture of two Gaussians with the following parameters

[TABLE]

and

[TABLE]

Further, we initialized the algorithms with the following parameters

[TABLE]

such that they lie on the line which is orthogonal to the direction defined by $\mu_{1}^{\star}$ and $\mu_{2}^{\star}$ . Additionally, $\Sigma_{1}$ and $\Sigma_{2}$ are initialized to be arbitrary positive definite matrices and $\alpha$ is arbitrarily initialized such that $\alpha_{1}+\alpha_{2}=1$ .

4.1 The PB-GEM algorithm

We first consider a pedagogical example to demonstrate the performance of the proposed PB-GEM algorithm on estimating the parameters of the synthetic Gaussian mixture model specified above. The results of running the PB-GEM algorithm to determine the parameters of the above mixture model are shown in Figures 1 and 2. We see that the proposed PB-GEM algorithm is able to successfully determine the parameters of the synthetic GMM from which the points have been sampled. Convergence is obtained in 316 iterations (i.e., to attain a relative change in log-likelihood smaller than $10^{-10}$ ).

4.2 The W-PB-GEM algorithm

Next, we test the performance of the W-PB-GEM algorithm. The matrix of weights $W$ that determine the mixture of proportions during the updates of the means is given by

[TABLE]

The results of running the W-PB-GEM algorithm with the same initializations for $\mu_{1},\mu_{2},\Sigma_{1},\Sigma_{2}$ , and $\alpha$ are documented in Figures 3 and 4. Convergence is obtained in 279 iterations with the same stopping criterion used in the previous section.

4.3 Multi-Class Classification

In what follows, we also present an illustrative example where both the PB-GEM and the W-PB-GEM algorithms are used in order to identify the parameters of a GMM with more than two Gaussians. We sample $1000$ arbitrary points from a mixture of four Gaussians with the following parameters

[TABLE]

and

[TABLE]

Further, we initialized both the algorithms with the following parameters

[TABLE]

Additionally, $\Sigma_{1},\Sigma_{2},\Sigma_{3},$ and $\Sigma_{4}$ are initialized to be arbitrary positive definite matrices and $\alpha$ is arbitrarily initialized such that $\sum_{i=1}^{4}\alpha_{i}=1$ . The matrix of weights $W$ that determine the mixture of proportions during the updates of the means for the W-PB-GEM algorithm is once again given by

[TABLE]

The results of running the PB-GEM and W-PB-GEM algorithms for this problem are shown in Figures 5 and 6 respectively. The results are similar to the case on two-class classification. Convergence (i.e., attaining a relative change in log-likelihood smaller than $10^{-10}$ ) is obtained in $1822$ iterations for the PB-GEM algorithm and in $472$ iterations for the W-PB-GEM algorithm.

4.4 Discussion of results

The reason why the initial conditions on the means are selected such that they lie on a line orthogonal to the means characterizing the synthetic GMM considered above is to intentionally make the convergence of the PB-GEM and W-PB-GEM algorithms more difficult. We also illustrate in Figure 7 a comparative study of the PB-GEM and W-PB-GEM algorithms for the two-class example (the matrix $W$ was selected identical to the one in Section 4.2) by plotting the mean and standard deviation of the negative log-likelihood function for 30 instances of both the algorithms when they are initialized with the same set of parameters for a particular instance. In general, such worst-case initialization conditions are useful in order to gain insights into the transient behaviors of such algorithms.

It is also instructive here to note that for the problem of identifying the parameters of a high-dimensional GMM, the number of iterations to convergence would grow exponentially. In such a case, it would be extremely important to have convergence to the actual parameters in as few iterations as possible, since each iteration would involve a pass over the entire dataset, and when the dataset is large, having a lower number of iterations to convergence would amount to a reduction in the amount of time taken for the estimation of the parameters.

We also reiterate that the convergence rates presented in (29) and (40) are merely upper bounds. As demonstrated in the insets of Figures 2 and 4, these have no relationship with the transient behaviors of the PB-GEM and W-PB-GEM algorithms. Although in practice we can improve the convergence rates of these algorithms by designing new and more efficient varieties (as detailed in the construction of the W-PB-GEM algorithm), the upper bound of the obtained convergence rates does not change.

5 Conclusions and Future Work

In this paper, we analyzed a GEM algorithm to estimate the parameters of GMMs from a dynamical systems perspective. In particular, we showed that this algorithm can be understood as an LTI system connected in feedback with a nonlinearity. The convergence properties of the proposed algorithm are studied by utilizing tools from robust control theory. We also explored the simple design of this class of GEM algorithms and provided evidence using pedagogical examples that it might be possible to improve the transient and the practical convergence of these algorithms despite the fact that they exhibit the same asymptotic convergence rates. Future work will consist of using tools from adaptive systems theory to accelerate practical convergence properties for GEM algorithms. Additionally, fundamental connections exist between the EM algorithm and proximal point methods (Chrétien and Hero, 2000, 2008; Figueiredo, 2008) and future work will focus on analyzing proximal interpretations of the EM algorithm using tools from robust control theory (Lessard et al., 2016; Fazlyab et al., 2018).

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1\NAT@swatrue
2Balakrishnan et al. (2017) Balakrishnan, S., Wainwright, M. J., and Yu, B. (2017). Statistical guarantees for the EM algorithm: From population to sample-based analysis. The Annals of Statistics , 45 (1), 77–120. \NAT@swatrue
3Bishop (2006) Bishop, C. (2006). Pattern Recognition and Machine Learning . Springer Verlag. \NAT@swatrue
4Chrétien and Hero (2000) Chrétien, S., and Hero, A. O. (2000). Kullback proximal algorithms for maximum-likelihood estimation. IEEE Transactions on Information Theory , 46 (5), 1800–1810. \NAT@swatrue
5Chrétien and Hero (2008) Chrétien, S., and Hero, A. O. (2008). On EM algorithms and their proximal generalizations. ESAIM: Probability and Statistics , 12 , 308–326. \NAT@swatrue
6Dempster et al. (1977) Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society, Series B (Methodological) , 1–38. \NAT@swatrue
7Fazlyab et al. (2018) Fazlyab, M., Ribeiro, A., Morari, M., and Preciado, V. M. (2018). Analysis of optimization algorithms via integral quadratic constraints: Nonstrongly convex problems. SIAM Journal on Optimization , 28 (3), 2654–2689. \NAT@swatrue
8Fessler and Hero (1995) Fessler, J. A., and Hero, A. O. (1995). Penalized maximum-likelihood image reconstruction using space-alternating generalized EM algorithms. IEEE Transactions on Image Processing , 4 (10), 1417–1429. \NAT@swatrue

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Analysis of a Generalized Expectation-Maximization Algorithm for Gaussian Mixture Models: A Control Systems Perspective

Abstract

keywords:

1 Introduction

2 Problem Statement

Assumption 1**.**

Assumption 2**.**

Assumption 3**.**

Problem 1**.**

3 Main Results

3.1 GEM Algorithms as LTI Systems with a Feedback Nonlinearity

Lemma 3.1** (Dempster et al. (1977)).**

Lemma 3.2**.**

Definition 3.3** (Q-convergence (Jay, 2001)).**

Definition 3.4** (Sector Integral Quadratic Constraint (IQC) for the gradient map).**

Lemma 3.5** (A modified version of Theorem 4 in Lessard et al. (2016)).**

Theorem 3.6**.**

Proof.

3.2 Towards the Design of GEM Algorithms

Corollary 3.7**.**

Remark 1*.*

Remark 2*.*

4 Pedagogical Examples

4.1 The PB-GEM algorithm

4.2 The W-PB-GEM algorithm

4.3 Multi-Class Classification

4.4 Discussion of results

5 Conclusions and Future Work

Assumption 1.

Assumption 2.

Assumption 3.

Problem 1.

Lemma 3.1 (Dempster et al. (1977)).

Lemma 3.2.

Definition 3.3 (Q-convergence (Jay, 2001)).

Definition 3.4 (Sector Integral Quadratic Constraint (IQC) for the gradient map).

Lemma 3.5 (A modified version of Theorem 4 in Lessard et al. (2016)).

Theorem 3.6.

Corollary 3.7.

*Remark 1**.*

*Remark 2**.*