Deterministic Quantum Annealing Expectation-Maximization Algorithm

Hideyuki Miyahara; Koji Tsumura; and Yuki Sughiyama

arXiv:1704.05822·stat.ML·November 21, 2017

Deterministic Quantum Annealing Expectation-Maximization Algorithm

Hideyuki Miyahara, Koji Tsumura, and Yuki Sughiyama

PDF

TL;DR

This paper introduces the DQAEM algorithm, a quantum annealing-based extension of EM, which improves maximum likelihood estimation by overcoming local optima issues, demonstrated through numerical simulations.

Contribution

The paper proposes the DQAEM algorithm, integrating quantum annealing with EM to enhance global optimization in maximum likelihood estimation.

Findings

01

DQAEM outperforms EM in numerical simulations.

02

DQAEM reduces dependence on initial configurations.

03

DQAEM effectively finds better optima in MLE tasks.

Abstract

Maximum likelihood estimation (MLE) is one of the most important methods in machine learning, and the expectation-maximization (EM) algorithm is often used to obtain maximum likelihood estimates. However, EM heavily depends on initial configurations and fails to find the global optimum. On the other hand, in the field of physics, quantum annealing (QA) was proposed as a novel optimization approach. Motivated by QA, we propose a quantum annealing extension of EM, which we call the deterministic quantum annealing expectation-maximization (DQAEM) algorithm. We also discuss its advantage in terms of the path integral formulation. Furthermore, by employing numerical simulations, we illustrate how it works in MLE and show that DQAEM outperforms EM.

Tables1

Table 1. Table 1: Success ratios of DQAEM, EM, and DSAEM.

DQAEM	EM	DSAEM
40.5 %	20.3 %	34.3 %

Equations101

K (θ)

K (θ)

K (θ)

K (θ)

L (θ)

L (θ)

KL

: = - i = 1 \sum N σ_{i} \in S^{σ} \sum q (σ_{i}) ln (\frac{f ( y _{i} , σ _{i} ; θ )}{\sum _{σ_{i} \in S^{σ}} f ( y _{i} , σ _{i} ; θ )} \frac{1}{q ( σ _{i} )}) .

q (σ_{i})

q (σ_{i})

L (θ) = Q (θ, θ^{'}) - i = 1 \sum N

L (θ) = Q (θ, θ^{'}) - i = 1 \sum N

Q (θ, θ^{'})

Q (θ, θ^{'})

θ_{t + 1} = θ arg max Q (θ; θ_{t}) .

θ_{t + 1} = θ arg max Q (θ; θ_{t}) .

\overset{σ}{^}_{i} ∣ σ_{i} = k ⟩

\overset{σ}{^}_{i} ∣ σ_{i} = k ⟩

⟨ σ_{i} = k ⟩ σ_{i} = l

⟨ σ_{i} = k ⟩ σ_{i} = l

G_{β, Γ} (θ)

G_{β, Γ} (θ)

Z_{β, Γ} (θ)

Z_{β, Γ} (θ)

Z_{β, Γ}^{(i)} (θ)

f_{β, Γ} (y_{i}, \overset{σ}{^}_{i}; θ)

f_{β, Γ} (y_{i}, \overset{σ}{^}_{i}; θ)

H (y_{i}, \overset{σ}{^}_{i}; θ)

H^{nc}

G_{β = 1, Γ = 0} (θ) = K (θ) .

G_{β = 1, Γ = 0} (θ) = K (θ) .

β G_{β, Γ} (θ)

β G_{β, Γ} (θ)

F_{β, Γ} (θ)

F_{β, Γ} (θ)

KL

KL

\displaystyle\coloneqq-\sum_{i=1}^{N}\mathrm{Tr}_{\sigma_{i}}\bigg{[}\hat{\rho}_{i}\bigg{\{}\ln\frac{f_{\beta,\Gamma}(y_{i},\hat{\sigma}_{i};\theta)}{\mathrm{Tr}_{\sigma_{i}}\left[f_{\beta,\Gamma}(y_{i},\hat{\sigma}_{i};\theta)\right]}-\ln\hat{\rho}_{i}\bigg{\}}\bigg{]}.

\overset{ρ}{^}_{i}

\overset{ρ}{^}_{i}

F_{β, Γ} (θ)

F_{β, Γ} (θ)

U_{β, Γ} (θ; θ^{'}) : = i = 1 \sum N Tr_{σ_{i}} [\frac{f _{β, Γ} ( y _{i} , σ ^ _{i} ; θ ^{'} )}{Tr _{σ_{i}} [ f _{β, Γ} ( y _{i} , σ ^ _{i} ; θ ^{'} ) ]} ln f_{β, Γ} (y_{i}, \overset{σ}{^}_{i}; θ)] .

U_{β, Γ} (θ; θ^{'}) : = i = 1 \sum N Tr_{σ_{i}} [\frac{f _{β, Γ} ( y _{i} , σ ^ _{i} ; θ ^{'} )}{Tr _{σ_{i}} [ f _{β, Γ} ( y _{i} , σ ^ _{i} ; θ ^{'} ) ]} ln f_{β, Γ} (y_{i}, \overset{σ}{^}_{i}; θ)] .

θ_{t + 1} = θ arg max U_{β, Γ} (θ, θ_{t}) .

θ_{t + 1} = θ arg max U_{β, Γ} (θ, θ_{t}) .

U_{β, Γ = 0} (θ; θ^{'})

U_{β, Γ = 0} (θ; θ^{'})

\times σ_{i} \in S^{σ} \sum ⟨ σ_{i} ⟩ f_{β, Γ = 0} (y_{i}, \overset{σ}{^}_{i}; θ^{'}) ln f_{β, Γ = 0} (y_{i}, \overset{σ}{^}_{i}; θ) σ_{i},

U_{β, Γ = 0} (θ; θ^{'})

U_{β, Γ = 0} (θ; θ^{'})

\displaystyle\sum_{i=1}^{N}\sum_{\sigma_{i}\in S^{\sigma}}\bigg{\{}\frac{f_{\beta,\Gamma=0}(y_{i},{\sigma}_{i};\theta_{t})}{\sum_{\sigma_{i}\in S^{\sigma}}f_{\beta,\Gamma=0}(y_{i},\sigma_{i};\theta_{t})}\frac{d}{d\theta}H(y_{i},{\sigma}_{i};\theta)\bigg{\}}

\displaystyle\sum_{i=1}^{N}\sum_{\sigma_{i}\in S^{\sigma}}\bigg{\{}\frac{f_{\beta,\Gamma=0}(y_{i},{\sigma}_{i};\theta_{t})}{\sum_{\sigma_{i}\in S^{\sigma}}f_{\beta,\Gamma=0}(y_{i},\sigma_{i};\theta_{t})}\frac{d}{d\theta}H(y_{i},{\sigma}_{i};\theta)\bigg{\}}

\displaystyle\sum_{i=1}^{N}\mathrm{Tr}_{\sigma_{i}}\bigg{[}\frac{f_{\beta,\Gamma}(y_{i},\hat{\sigma}_{i};\theta_{t})}{\mathrm{Tr}_{\sigma_{i}}\left[f_{\beta,\Gamma}(y_{i},\hat{\sigma}_{i};\theta_{t})\right]}\frac{d}{d\theta}H(y_{i},\hat{\sigma}_{i};\theta)\bigg{]}

\displaystyle\sum_{i=1}^{N}\mathrm{Tr}_{\sigma_{i}}\bigg{[}\frac{f_{\beta,\Gamma}(y_{i},\hat{\sigma}_{i};\theta_{t})}{\mathrm{Tr}_{\sigma_{i}}\left[f_{\beta,\Gamma}(y_{i},\hat{\sigma}_{i};\theta_{t})\right]}\frac{d}{d\theta}H(y_{i},\hat{\sigma}_{i};\theta)\bigg{]}

i = 1 \sum N σ_{i} \in S^{σ} \sum

i = 1 \sum N σ_{i} \in S^{σ} \sum

⟨ σ_{i} ⟩ f_{β, Γ} (y_{i}, \overset{σ}{^}_{i}; θ^{'}) σ_{i}

⟨ σ_{i} ⟩ f_{β, Γ} (y_{i}, \overset{σ}{^}_{i}; θ^{'}) σ_{i}

= M \to \infty lim σ_{i, 1}^{'}, σ_{i, 1}, \dots, σ_{i, M - 1}, σ_{i, M}^{'} \in S^{σ} \sum j = 1 \prod M ⟨ σ_{i, j} ⟩ e^{- \frac{β}{M} H (y_{i}, \overset{σ}{^}_{i}; θ^{'})} σ_{i, j}^{'} ⟨ σ_{i, j}^{'} ⟩ e^{- \frac{β}{M} H^{nc}} σ_{i, j - 1},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Deterministic Quantum Annealing Expectation-Maximization Algorithm

Hideyuki Miyahara

[email protected]

Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongosanchome Bunkyo-ku Tokyo 113-8656, Japan

Koji Tsumura

Graduate School of Information Science and Technology, The University of Tokyo, 7-3-1 Hongosanchome Bunkyo-ku Tokyo 113-8656, Japan

Yuki Sughiyama

Institute of Industrial Science, The University of Tokyo, 4-6-1, Komaba, Meguro-ku, Tokyo 153-8505, Japan

Abstract

Maximum likelihood estimation (MLE) is one of the most important methods in machine learning, and the expectation-maximization (EM) algorithm is often used to obtain maximum likelihood estimates. However, EM heavily depends on initial configurations and fails to find the global optimum. On the other hand, in the field of physics, quantum annealing (QA) was proposed as a novel optimization approach. Motivated by QA, we propose a quantum annealing extension of EM, which we call the deterministic quantum annealing expectation-maximization (DQAEM) algorithm. We also discuss its advantage in terms of the path integral formulation. Furthermore, by employing numerical simulations, we illustrate how it works in MLE and show that DQAEM outperforms EM.

pacs:

03.67.-a, 03.67.Ac, 89.90.+n, 89.70.-a, 89.20.-a

1 Introduction

Machine learning gathers considerable attention in a wide range of fields [1, 2]. In unsupervised learning, which is a major branch of machine learning, maximum likelihood estimation (MLE) plays an important role to characterize a given data set. One of the most common and practical approaches for MLE is the expectation-maximization (EM) algorithm. Although EM is widely used [3], it is also known to be trapped in local optima depending on initial configurations due to non-convexity of log-likelihood functions.

One of the breakthroughs for non-convex optimization is simulated annealing (SA), proposed by Kirkpatrick et al. [4, 5]. In SA, a random variable, which mimics thermal fluctuations, is added during the optimization process to overcome potential barriers in non-convex optimization. Moreover, the global convergence of SA is guaranteed when the annealing process is infinitely long [6]. Motivated by SA and its variant [7, 8], Ueda and Nakano developed a deterministic simulated annealing expectation-maximization (DSAEM) algorithm 111This algorithm is originally called the deterministic annealing expectation-maximization algorithm in Ref. [9]. However, to distinguish our and their approaches, we refer to it as DSAEM in this paper. by introducing thermal fluctuations into EM [9]. This approach succeeded in improving the performance of EM without the increase of numerical costs. However, problems caused by non-convexity are still remaining.

Another approach for non-convex optimization is quantum annealing (QA), which was proposed in Refs. [10, 11, 12], and it has been intensively studied by many physicists [13, 14, 15, 16, 17, 18, 19, 20, 21]. One of the reasons why QA attracts great interest is that, in some cases, it outperforms SA [16]. Another reason is that QA can be directly implemented on a quantum computer, and much effort is devoted to realize one via many approaches [22, 23, 24, 25]. Moreover, quantum algorithms for machine learning are extensively studied [26, 27, 28].

In this study, to improve the performances of EM and DSAEM, by employing QA we propose a new algorithm, which we call the deterministic quantum annealing expectation-maximization (DQAEM) algorithm. To be more precise, by quantizing hidden variables in EM and adding a non-commutative term, we extend classical algorithms: EM and DSAEM. Then, we discuss the mechanism of DQAEM from the viewpoint of the path integral formulation. Through this discussion, we elucidate how DQAEM overcomes the problem of local optima. Furthermore, as applications, we focus on clustering problems with Gaussian mixture models (GMMs). Here, it is confirmed that DQAEM outperforms EM and DSAEM through numerical simulations.

This paper is organized as follows. In Sec. 2, we review MLE and EM to prepare for DQAEM. In Sec. 3, which is the main section of this paper, we present the formulation of DQAEM. Next, we give an interpretation of DQAEM and show its advantage over EM and DSAEM from the viewpoint of the path integral formulation in Sec. 4. Then, in Sec. 5, we introduce GMMs and demonstrate numerical simulations. Here, it is found that DQAEM is superior to EM and DSAEM. Finally, Sec. 6 conclude this paper.

2 Maximum likelihood estimation and expectation-maximization algorithm

To make this paper self-contained, we review MLE and EM. Suppose that we have $N$ data points $Y_{\mathrm{obs}}=\{y_{1},y_{2},\dots,y_{N}\}$ , which are independent and identically distributed, and let $\{\sigma_{1},\sigma_{2},\dots,\sigma_{N}\}$ be a set of hidden variables. In this paper, we denote the joint probability density function on $y_{i}$ and $\sigma_{i}$ with a parameter $\theta$ as $f(y_{i},\sigma_{i};\theta)$ .

Using these definitions, the log-likelihood function of $Y_{\mathrm{obs}}$ is represented by

[TABLE]

where $S^{\sigma}$ represents a discrete configuration set of $\sigma_{i}$ ; that is, $S^{\sigma}=\{1,2,\dots,K\}$ . In MLE, we estimate the parameter $\theta$ by maximizing the log-likelihood function, Eq. (1). However, it is difficult to maximize Eq. (1), because $\mathcal{K}(\theta)$ is analytically unsolvable and primitive methods, such as Newton’s method, are known to be less effective [29]. We therefore often use EM for practical applications [1, 2].

EM consists of two steps. To introduce them, we decompose $\mathcal{K}(\theta)$ into two parts:

[TABLE]

where

[TABLE]

Here, we use an arbitrary probability function $q(\sigma_{i})$ that satisfies $\sum_{\sigma_{i}\in S^{\sigma}}q(\sigma_{i})=1$ . Note that $\mathrm{KL}(\cdot\|\cdot)$ is the Kullback-Leibler divergence [30, 31]. On the basis of this decomposition, EM is composed of the following two steps, which are called the E and M steps. In the E step, we minimize the KL divergence, Eq (4), with respect to $q(\sigma_{i})$ under a fixed $\theta^{\prime}$ . Then we obtain

[TABLE]

Next, by substituting Eq. (5) into Eq. (3), we obtain

[TABLE]

where

[TABLE]

In the M step, we maximize $\mathcal{L}(\theta)$ instead of $\mathcal{K}(\theta)$ with respect to $\theta$ ; that is, we maximize $\mathcal{Q}(\theta,\theta^{\prime})$ under the fixed $\theta^{\prime}$ . In EM, we iterate these two steps. Thus, assuming that $\theta_{t}$ be a tentative estimated parameter at the $t$ -th iteration, the new estimated parameter $\theta_{t+1}$ is determined by

[TABLE]

This procedure is summarized in Algo. 1 with pseudo-code.

Despite the success of EM, it is known that EM sometimes trapped in local optima and fails to estimate the optimal parameter [1, 2]. To relax these problems, we improve EM by employing methods in quantum mechanics.

3 Deterministic quantum annealing expectation-maximization algorithm

This section is the main part of this paper. We formulate DQAEM by quantizing the hidden variables $\{\sigma_{i}\}_{i=1}^{N}$ in EM and employing the annealing technique. In the previous section, we denoted by $\sigma_{i}$ and $S^{\sigma}$ the hidden variable for each data point $i$ and the set of its possible values, respectively. Corresponding to the classical setup, we introduce a quantum one as follows. We define an operator $\hat{\sigma}_{i}$ and a ket vector $\Ket{\sigma_{i}=k}$ $(k\in S^{\sigma})$ so that they satisfy

[TABLE]

Here we note that the eigenvalues of $\hat{\sigma}_{i}$ correspond to the possible values of the hidden variable $\sigma_{i}$ in the classical setup. In addition, we introduce a bra vector corresponding to $\Ket{\sigma_{i}=k}$ as $\Bra{\sigma_{i}=l}$ $(l\in S^{\sigma})$ so that it satisfies

[TABLE]

Moreover, we denote the trace on $\hat{\sigma}_{i}$ by $\mathrm{Tr}_{\sigma_{i}}[\cdot]$ ; using the ket vector $\Ket{\sigma_{i}=k}$ and the bra vector $\Bra{\sigma_{i}=k}$ , it is represented by $\mathrm{Tr}_{\sigma_{i}}[\cdot]=\sum_{k=1}^{K}\Braket{\sigma_{i}=k}{\cdot}{\sigma_{i}=k}$ . To simplify the notation, we sometimes use $\Ket{\sigma_{i}}$ and $\Bra{\sigma_{i}}$ for $\Ket{\sigma_{i}=k}$ and $\Bra{\sigma_{i}=k}$ , respectively, and the trace on $\sigma_{i}$ is also represented by $\mathrm{Tr}_{\sigma_{i}}[\cdot]=\sum_{\sigma_{i}\in S^{\sigma}}\Braket{\sigma_{i}}{\cdot}{\sigma_{i}}$ .

Next, we define the negative free energy function as

[TABLE]

where $\mathcal{Z}_{\beta,\Gamma}(\theta)$ denotes the partition function that is given by

[TABLE]

Here, $f_{\beta,\Gamma}(y_{i},\hat{\sigma}_{i};\theta)$ is the exponential weight with a non-commutative term $H^{\mathrm{nc}}$ :

[TABLE]

where the non-commutative relation $[\hat{\sigma}_{i},\hat{\sigma}^{\mathrm{nc}}]\neq 0$ is imposed, and $\beta$ and $\Gamma$ represent parameters for simulated and quantum annealing, respectivlely. If we set $\Gamma=0$ and $\beta=1$ , the negative free energy function, Eq. (11), reduces to the log-likelihood function, Eq. (1):

[TABLE]

Therefore, in annealing processes, $\Gamma$ and $\beta$ are changed from appropriate initial values to [math] and $1$ , respectively. Corresponding to EM in the previous section, we construct the E and M steps in DQAEM. Using an arbitrary density matrix $\hat{\rho}_{i}$ that satisfies $\mathrm{Tr}_{\sigma_{i}}[\hat{\rho}_{i}]=1$ , we divide the negative free energy function, Eq. (11), into two parts as

[TABLE]

where

[TABLE]

Here, $\mathrm{KL}(\hat{\rho}_{1}\|\hat{\rho}_{2})=\mathrm{Tr}[\hat{\rho}_{1}(\ln\hat{\rho}_{1}-\ln\hat{\rho}_{2})]$ is a quantum extension of the Kullback-Leibler divergence between density matrices $\hat{\rho}_{1}$ and $\hat{\rho}_{2}$ [32].

In the E step of DQAEM, Eq. (20) under a fixed $\theta$ is minimized; then we obtain

[TABLE]

which is a quantum extension of Eq. (5). Note that, to calculate $f_{\beta,\Gamma}(y_{i},\hat{\sigma}_{i};\theta)$ , we need to use the Suzuki-Trotter expansion [33], the Taylor expansion, the Padé approximation or the technique proposed in Ref. [34], because $f_{\beta,\Gamma}(y_{i},\hat{\sigma}_{i};\theta^{\prime})$ has the non-commutative term $H^{\mathrm{nc}}$ .

On the other hand, in the M step of DQAEM, the parameter $\theta$ is determined by maximizing Eq. (19). Following the method in the previous section, we substitute Eq. (21) into Eq. (19); then we have

[TABLE]

where

[TABLE]

Note that $\mathcal{U}_{\beta,\Gamma}(\theta;\theta^{\prime})$ represents the quantum extension of $\mathcal{Q}(\theta,\theta^{\prime})$ , Eq. (7). Thus, the computation of the M step of DQAEM is written as

[TABLE]

During iterations, the annealing parameter $\Gamma$ , which controls the strength of quantum fluctuations, is changed from the initial value $\Gamma_{0}\ (\Gamma_{0}\geq 0)$ to [math], and the other parameter $\beta$ , which controls the strength of thermal fluctuations, is changed from the initial value $\beta_{0}\ (0<\beta_{0}\leq 1)$ to $1$ . We summarize DQAEM in Algo. 2.

Finally, we mention that the free energy function, Eq. (11), increases monotonically when the parameter $\theta_{t}$ is updated by DQAEM. The proof is given in A.

4 Mechanism of DQAEM

In this section, we explain an advantage of DQAEM over EM and DSAEM using the path integral formulation. First, we demonstrate that DQAEM is a quantum extension of EM and DSAEM. Setting $\Gamma=0$ in Eq. (23), which corresponds to the classical limit, we have

[TABLE]

where we use $\mathrm{Tr}_{\sigma_{i}}[\cdot]=\sum_{\sigma_{i}\in S^{\sigma}}\Braket{\sigma_{i}}{\cdot}{\sigma_{i}}$ . Taking into account that $f_{\beta,\Gamma=0}(y_{i},\hat{\sigma}_{i};\theta^{\prime})$ does not have the non-commutative term $H^{\mathrm{nc}}$ , we get

[TABLE]

Equation (24) with Eq. (26) provides the update equation for $\theta_{t+1}$ in DSAEM, and we obtain $\theta_{t+1}$ by solving

[TABLE]

This update rule is equivalent to that of DSAEM [35]. Furthermore, if we set $\beta=1$ , Eq. (26) equals to Eq. (7) and DSAEM also reduces to EM.

On the other hand, in DQAEM, the parameter $\theta$ is updated based on Eq. (24) with Eq. (23). Thus we obtain $\theta_{t+1}$ by solving

[TABLE]

Using the bra-ket notation, Eq. (28) can be arranged as

[TABLE]

The difference between Eqs. (27) and (29) is the existence of the non-commutative term $H^{\mathrm{nc}}$ in $f_{\beta,\Gamma}(y_{i},\hat{\sigma}_{i};\theta^{\prime})$ , which gives rise to the advantage of DQAEM. To evaluate the effect of $H^{\mathrm{nc}}$ , we calculate $\Braket{\sigma_{i}}{f_{\beta,\Gamma}(y_{i},\hat{\sigma}_{i};\theta^{\prime})}{\sigma_{i}}$ by applying the Suzuki-Trotter expansion [33]. Then we obtain

[TABLE]

with the boundary conditions $\Ket{\sigma_{i,0}}=\Ket{\sigma_{i,M}}=\Ket{\sigma_{i}}$ and $\Bra{\sigma_{i,0}}=\Bra{\sigma_{i,M}}=\Bra{\sigma_{i}}$ . In Eq. (31), the quantum effect comes from $\Braket{\sigma_{i,j}}{e^{-\frac{\beta}{M}H^{\mathrm{nc}}}}{\sigma_{i,j-1}}$ . If we assume the classical case $H^{\mathrm{nc}}=0$ (i.e. $\Gamma=0$ ), we obtain $\Braket{\sigma_{i,j}}{e^{-\frac{\beta}{M}H^{\mathrm{nc}}}}{\sigma_{i,j-1}}=\Braket{\sigma_{i,j}}{\sigma_{i,j-1}}=\delta_{\sigma_{i,j},\sigma_{i,j-1}}$ , where $\delta_{\cdot,\cdot}$ is the Kronecker delta function. Thus, $\Ket{\sigma_{i,j}}$ does not depend on the index along the Trotter dimension, $j$ . In terms of the path integral formulation, this fact implies that $\Braket{\sigma_{i}}{f_{\beta,\Gamma}(y_{i},\hat{\sigma}_{i};\theta^{\prime})}{\sigma_{i}}$ can be evaluated by a single classical path with the boundary condition fixed at $\sigma_{i}$ ; see Fig. 1. On the other hand, in the quantum case, $\Braket{\sigma_{i,j}}{e^{-\frac{\beta}{M}H^{\mathrm{nc}}}}{\sigma_{i,j-1}}\neq\Braket{\sigma_{i,j}}{\sigma_{i,j-1}}$ , and therefore Eq. (31) involves not only the classical path but also quantum paths, which depend on the form of $H^{\mathrm{nc}}$ ; see Fig. 1. Thus, owing to these quantum paths, DQAEM may overcome the problem of local optima and it is expected that DQAEM outperforms EM and DSAEM. In Sec. 5, we show that the quantum effect really helps EM to find the global optimum through numerical simulations.

Before closing this section, we represent Eq. (31) with the path integral form. In the limit $M\rightarrow\infty$ , we define

[TABLE]

where $\tau=\beta j/M$ . This notation leads to another expression of Eq. (31) as

[TABLE]

where $\int\mathcal{D}\sigma_{i}(\tau)$ represents the integration over all paths and $S[\sigma_{i}(\tau)]$ is the action given by

[TABLE]

Note that this action also appears in the discussion of the Berry phase [36, 37].

5 Numerical simulations

In this section, we present numerical simulations to confirm the performance of DQAEM. Here, we deal with GMMs with hidden variables. We introduce the quantum representation of GMMs in Sec. 5.1, and give update equations in Sec. 5.2. In Sec. 5.3, we show numerical results to elucidate the advantage of DQAEM over EM and DSAEM.

5.1 Gaussian mixture models

First, we provide a concrete setup of GMMs. Consider that a GMM is composed of $K$ Gaussian functions and the domain of the hidden variables $S^{\sigma}$ is given by $\{1,2,\dots,K\}$ 222When we work on the one-hot notation [1, 2], we can formulate an equivalent quantization scheme. [38]. Then the distribution of the GMM is written by

[TABLE]

where $g^{k}(y_{i};\mu^{k},\Sigma^{k})$ is the $k$ -th Gaussian function of the GMM parametrized by $\{\mu^{k},\Sigma^{k}\}$ , and $\delta_{\cdot,\cdot}$ is the Kronecker delta. The parameters $\{\pi_{k}\}_{k=1}^{K}$ stand for the mixing ratios that satisfy $\sum_{k=1}^{K}\pi^{k}=1$ . Here, we denote all the parameters collectively by $\theta$ : $\theta=\{\pi^{k},\mu^{k},\Sigma^{k}\}_{k=1}^{K}$ . From Eq. (15), we get the Hamiltonian of this system:

[TABLE]

where we take the logarithm of Eq. (35) and $h_{i}^{k}=-\ln(\pi^{k}g(y_{i};\mu^{k},\Sigma^{k}))$ .

Next, following the discussion of Sec. 3, we define the ket vector $\Ket{\sigma_{i}}$ , and the operator $\hat{\sigma}_{i}$ as

[TABLE]

where $k$ represents the label of components of the GMM; that is, $k\in S^{\sigma}$ . To quantize Eq. (36), we replace $\sigma_{i}$ by $\hat{\sigma}_{i}$ . Then, we obtain the quantum Hamiltonian of the GMM as

[TABLE]

where we use $\delta_{\hat{\sigma}_{i},k}=\Ket{\sigma_{i}=k}\Bra{\sigma_{i}=k}$ . If we adopt the representation,

[TABLE]

Eq. (39) can be expressed as

[TABLE]

Finally we add a non-commutative term $H^{\mathrm{nc}}=\Gamma\hat{\sigma}^{\mathrm{nc}}$ with $[\hat{\sigma}_{i},\hat{\sigma}^{\mathrm{nc}}]\neq 0$ to Eq. (38). Obviously, there are many candidates for $\hat{\sigma}^{\mathrm{nc}}$ ; however, in the numerical simulation, we adopt

[TABLE]

where $\Ket{\sigma_{i}=0}=\Ket{\sigma_{i}=K}$ and $\Ket{\sigma_{i}=K+1}=\Ket{\sigma_{i}=1}$ . This term induces quantum paths shown in Fig. 1.

5.2 Update equations

Suppose that we have $N$ data points $Y_{\mathrm{obs}}=\{y_{1},y_{2},\dots,y_{N}\}$ , and they are independent and identically distributed obeying a GMM. In EM, the update equations of GMMs are obtained by Eq. (8) with the constraint $\sum_{k=1}^{K}\pi^{k}=1$ ; then the parameters at the $(t+1)$ -th iteration, $\theta_{t+1}$ , are given by

[TABLE]

where $N_{k}=\sum_{i=1}^{N}\frac{f(y_{i},\sigma_{i}=k;\theta_{t})}{\sum_{\sigma_{i}\in S^{\sigma}}f(y_{i},\sigma_{i};\theta_{t})}$ , $\theta_{t}$ is the tentative estimated parameter at the $t$ -th iteration.

In DQAEM, from Eq. (24), the update equations are expressed by

[TABLE]

where $N_{k}^{\mathrm{QA}}=\sum_{i=1}^{N}\frac{f_{\beta,\Gamma}^{\mathrm{QA}}(y_{i},\sigma_{i}=k;\theta_{t})}{\mathrm{Tr}_{\sigma_{i}}\left[f_{\beta,\Gamma}(y_{i},\hat{\sigma}_{i};\theta_{t})\right]}$ and we denote $\Braket{\sigma_{i}=k}{f_{\beta,\Gamma}(y_{i},\hat{\sigma}_{i};\theta_{t})}{\sigma_{i}=k}$ by $f_{\beta,\Gamma}^{\mathrm{QA}}(y_{i},\sigma_{i}=k;\theta_{t})$ for simplicity. If we set $\Gamma=0$ and $\beta=1$ , the update equations of DQAEM, Eqs. (46), (47), and (48) reduce to those of EM, Eqs. (43), (44), and (45). In annealing schedules, the parameters $\beta$ and $\Gamma$ are changed from initial values to $1$ and [math] via iterations, respectively.

5.3 Numerical results

To clarify the advantage of DQAEM, we compare DQAEM with EM and DSAEM by using numerical simulations. In the annealing schedule of DSAEM, we exponentially change $\beta$ from $0.7$ to $1$ , following $\beta_{t}=(\beta_{0}-1)\exp(-t/0.95)+1$ where $\beta_{t}$ denotes $\beta$ at the $t$ -iteration. On the other hand, in DQAEM, we exponentially change $\Gamma$ from $1.2$ to [math], following $\Gamma_{t}=\Gamma_{0}\exp(-t/0.95)$ where $\Gamma_{t}$ is $\Gamma$ at the $t$ -th iteration. To compare the “pure” quantum effect in DQAEM with the thermal effect in DSAEM, we fix $\beta_{t}=1.0$ in the annealing schedule of DQAEM. In the numerical simulation, we apply three algorithms to the two-dimensional data set shown in Fig. 2(a).

As a result, the log-likelihood functions of EM, the negative free energy functions of DSAEM and those of DQAEM are plotted by red lines, orange lines and blue lines in Fig. 2(b), respectively. In this numerical simulation, the optimal value of the log likelihood function is $-3616.6$ , which is depicted by the green line. From this figure, we see that some trials of three algorithms succeed to find the global optimum, but others fail.

To clarify the performances of three algorithms, we perform DQAEM, EM and DSAEM 1000 times, respectively. We summarize the success ratios of DQAEM, EM and DSAEM in Table 1. From Table. 1, we conclude that DQAEM is superior to both EM and DSAEM.

For intuitive understanding of the reason why DQAEM outperforms EM and DSAEM, we add some more numerical simulations in B.

6 Conclusion

In this paper, we have proposed a new algorithm, which we call DQAEM, to relax the problem of local optima of EM and DSAEM by introducing the mechanism of quantum fluctuations. After formulating it, we have discussed its mechanism from the viewpoint of the path integral formulation. Furthermore, as applications, we have adopted GMMs. Then we have elucidated that DQAEM outperforms EM and DSAEM through numerical simulations. Before closing this paper, we note that the form of the non-commutative term $H^{\mathrm{nc}}$ has arbitrariness. The optimal choice of the term is an open problem similarly to original quantum annealing.

Acknowledgements

HM thanks to Masayuki Ohzeki and Shu Tanaka for fruitful discussions. This research is partially supported by the Platform for Dynamic Approaches to Living Systems funded by MEXT and AMED, and Grant-in-Aid for Scientific Research (B) (16H04382), Japan Society for the Promotion of Science.

Appendix A Convergence theorem

We give a theorem that the negative free energy function has monotonicity for iterations.

Theorem 1.

Let $\theta_{t+1}=\operatorname*{arg\,max}_{\theta}\,\mathcal{U}_{\beta,\Gamma}(\theta;\theta_{t})$ . Then $\mathcal{G}_{\beta,\Gamma}(\theta_{t+1})\geq\mathcal{G}_{\beta,\Gamma}(\theta_{t})$ holds.

Proof.

First, the difference of the free energy functions in each iteration can be written as

[TABLE]

where

[TABLE]

The first two terms in the right hand side of Eq. (50) is positive due to Eq. (24). Furthermore, we can show that the rest of the right hand side of Eq. (50) are positive by the following calculations:

[TABLE]

∎

This theorem insists that DQAEM converges to at least the global optimum or a local optimum. The global convergence of EM is discussed by Dempster et al. [3] and Wu [39], and their discussion is available to DQAEM.

Appendix B Numerical results

We discuss the performances of DQAEM, EM, and DSAEM by showing how the landscape of the negative free energy function behaves when $\beta$ and $\Gamma$ are changed. For simplicity, we assume that a GMM is composed of two one-dimensional Gaussian functions and consider the case where only means of two Gaussian functions are estimated. Accordingly, the joint probability density function is given by

[TABLE]

where $g(y_{i};\mu,\Sigma)$ is a Gaussian function with mean $\mu$ and covariance $\Sigma$ . Under this setup, we estimate $\theta=\{\mu^{1},\mu^{2}\}$ : we assume that $\theta=\{\mu^{1},\mu^{2}\}$ are unknown and $\{\pi^{1},\pi^{2},\Sigma^{1},\Sigma^{2}\}$ are given.

First, let us describe the landscapes of the negative free energy functions with different $\beta$ and $\Gamma$ . We plot the negative free energy functions of DQAEM at $\Gamma=0.0$ , $5.0$ , $10.0$ , and $50.0$ with $\beta=1$ in Fig. 3(a), (b), (c), and (d), respectively. Specifically, Fig. 3(a) describes the log-likelihood function, since DQAEM reduces to EM at $\beta=1$ and $\Gamma=0$ . In our example used here, the global optimal value for $\{\mu_{1},\mu_{2}\}$ is $\{-2.0,4.0\}$ (the left top in Fig. 3(a)) and the point $\{4.0,-2.0\}$ (the right top in Fig. 3(a)) is the local optimum. If we set the initial value at a point close to the local optimum, EM fails to find the global optimum. In contrast, Fig. 3 shows that the negative free energy function changes from a multimodal form to a unimodal form when $\Gamma$ increases. Thus, even if we set the initial value close to the local optimum, DQAEM may find the global optimum through the annealing process. On the other hand, in Fig. 4, we plot the negative free energy functions of DSAEM at inverse temperature $\beta=1.0$ , $0.3$ , $0.1$ , and $0.01$ . DSAEM has the similar effect in the negative free energy function.

However, as we have seen in Sec. 5.3, they differ qualitatively. To understand this fact intuitively, we show the trajectories of the estimated parameters of EM, DSAEM, and DQAEM in Fig. 5. In this numerical simulation, the initial parameter $\theta_{0}$ is set near the local optimum in the log-likelihood function. The red line depicts the trajectory of the estimated parameter of EM and goes to the local optimum orthogonally crossing the contour plots. As shown by the orange line, DSAEM eventually fails to find the global optimum as same as the case of EM. On the other hand, the estimated parameter $\theta_{t}$ of DQAEM shown by the blue line surmounts the potential barrier and reaches the global optimum. Accordingly, we consider that DQAEM outperforms EM and DSAEM.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Bishop C 2007 Pattern recognition and machine learning (information science and statistics), 1st edn. 2006. corr. 2nd printing edn
2[2] Murphy K P 2012 Machine learning: a probabilistic perspective (MIT press)
3[3] Dempster A P, Laird N M and Rubin D B 1977 Journal of the Royal Statistical Society, Series B 39 1–38
4[4] Kirkpatrick S, Gelatt C D and Vecchi M P 1983 Science 220 671–680 URL http://www.sciencemag.org/content/220/4598/671.abstract
5[5] Kirkpatrick S 1984 Journal of Statistical Physics 34 975–986 ISSN 0022-4715 URL http://dx.doi.org/10.1007/BF 01009452
6[6] Geman S and Geman D 1984 Pattern Analysis and Machine Intelligence, IEEE Transactions on PAMI-6 721–741 ISSN 0162-8828
7[7] Rose K, Gurewitz E and Fox G 1990 Pattern Recognition Letters 11 589–594 ISSN 0167-8655 URL http://www.sciencedirect.com/science/article/pii/016786559090010 Y
8[8] Rose K, Gurewitz E and Fox G C 1990 Phys. Rev. Lett. 65 (8) 945–948 URL http://link.aps.org/doi/10.1103/Phys Rev Lett.65.945

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Deterministic Quantum Annealing Expectation-Maximization Algorithm

Abstract

pacs:

1 Introduction

2 Maximum likelihood estimation and expectation-maximization algorithm

3 Deterministic quantum annealing expectation-maximization algorithm

4 Mechanism of DQAEM

5 Numerical simulations

5.1 Gaussian mixture models

5.2 Update equations

5.3 Numerical results

6 Conclusion

Acknowledgements

Appendix A Convergence theorem

Theorem 1**.**

Proof.

Appendix B Numerical results

Theorem 1.