Stochastic Divergence Minimization for Biterm Topic Model

Zhenghang Cui; Issei Sato; Masashi Sugiyama

arXiv:1705.00394·stat.ML·April 4, 2018

Stochastic Divergence Minimization for Biterm Topic Model

Zhenghang Cui, Issei Sato, Masashi Sugiyama

PDF

TL;DR

This paper introduces a stochastic divergence minimization algorithm for the Biterm Topic Model, improving the accuracy and scalability of short text topic inference compared to existing methods.

Contribution

It proposes a novel stochastic inference algorithm for BTM that reduces computational complexity and enhances estimation accuracy over prior approaches.

Findings

01

The new algorithm outperforms existing inference methods in experiments.

02

It achieves better scalability for large short text datasets.

03

Demonstrates improved accuracy in latent topic estimation.

Abstract

As the emergence and the thriving development of social networks, a huge number of short texts are accumulated and need to be processed. Inferring latent topics of collected short texts is useful for understanding its hidden structure and predicting new contents. Unlike conventional topic models such as latent Dirichlet allocation (LDA), a biterm topic model (BTM) was recently proposed for short texts to overcome the sparseness of document-level word co-occurrences by directly modeling the generation process of word pairs. Stochastic inference algorithms based on collapsed Gibbs sampling (CGS) and collapsed variational inference have been proposed for BTM. However, they either require large computational complexity, or rely on very crude estimation. In this work, we develop a stochastic divergence minimization inference algorithm for BTM to estimate latent topics more accurately in a…

Tables1

Table 1. Table 1: Comparison Between Algorithms.

	Update Cost	Memory	Purpose
iBTM	$𝒪 (R)$	$𝒪 (K (1 + W) + B_{t})$	Post. Approx.
oBTM	$𝒪 (B_{t} (1 + K W))$	$𝒪 (K (1 + W) + N_{B})$	Post. Approx.
SCVB0-BTM	$𝒪 (K)$	$𝒪 (K (1 + W))$	Post. Approx.
SDM-BTM	$𝒪 (K)$	$𝒪 (K (1 + W))$	LOO Est.

Equations146

P (B ∣ θ, Φ) = i = 1 \prod N_{B} k = 1 \sum K θ_{k} ϕ_{k, w_{i 1}} ϕ_{k, w_{i 2}} .

P (B ∣ θ, Φ) = i = 1 \prod N_{B} k = 1 \sum K θ_{k} ϕ_{k, w_{i 1}} ϕ_{k, w_{i 2}} .

P (z_{i} = k ∣ z_{\ i}, B) \propto (n_{\ i, k} + γ) \frac{( n _{\ i, w_{i 1} ∣ k} + β ) ( n _{\ i, w_{i 2} ∣ k} + β )}{( n _{\ i, \cdot ∣ k} + W β ) ( n _{\ i, \cdot ∣ k} + W β + 1 )} .

P (z_{i} = k ∣ z_{\ i}, B) \propto (n_{\ i, k} + γ) \frac{( n _{\ i, w_{i 1} ∣ k} + β ) ( n _{\ i, w_{i 2} ∣ k} + β )}{( n _{\ i, \cdot ∣ k} + W β ) ( n _{\ i, \cdot ∣ k} + W β + 1 )} .

ϕ_{k, w} = \frac{n _{w ∣ k} + β}{n _{\cdot ∣ k} + W β},

ϕ_{k, w} = \frac{n _{w ∣ k} + β}{n _{\cdot ∣ k} + W β},

θ_{k} = \frac{n _{k} + γ}{N _{B} + K γ},

θ_{k} = \frac{n _{k} + γ}{N _{B} + K γ},

P (z_{i} = k ∣ z_{\ i}^{(t)}, B^{(t)}, γ^{(t)}, β^{(t)}) \propto (n_{\ i, k}^{(t)} + γ_{k}^{(t)}) \frac{( n _{\ i, w_{1} ∣ k}^{(t)} + β _{k, w_{1}}^{(t)} ) ( n _{\ i, w_{2} ∣ k}^{(t)} + β _{k, w_{2}}^{(t)} )}{[ \sum _{w = 1}^{W} ( n _{\ i, w ∣ k}^{(t)} + β _{k, w}^{(t)} )] [ \sum _{w = 1}^{W} ( n _{\ i, w ∣ k}^{(t)} + β _{k, w}^{(t)} ) + 1 ]} .

P (z_{i} = k ∣ z_{\ i}^{(t)}, B^{(t)}, γ^{(t)}, β^{(t)}) \propto (n_{\ i, k}^{(t)} + γ_{k}^{(t)}) \frac{( n _{\ i, w_{1} ∣ k}^{(t)} + β _{k, w_{1}}^{(t)} ) ( n _{\ i, w_{2} ∣ k}^{(t)} + β _{k, w_{2}}^{(t)} )}{[ \sum _{w = 1}^{W} ( n _{\ i, w ∣ k}^{(t)} + β _{k, w}^{(t)} )] [ \sum _{w = 1}^{W} ( n _{\ i, w ∣ k}^{(t)} + β _{k, w}^{(t)} ) + 1 ]} .

γ_{k}^{(t + 1)} = γ_{k}^{(t)} + λ n_{k}^{(t)},

γ_{k}^{(t + 1)} = γ_{k}^{(t)} + λ n_{k}^{(t)},

β_{k, w}^{(t + 1)} = β_{k, w}^{(t)} + λ n_{w ∣ k}^{(t)},

β_{k, w}^{(t + 1)} = β_{k, w}^{(t)} + λ n_{w ∣ k}^{(t)},

z_{i, k} \propto (N_{\ i, k} + α) \frac{( N _{\ i, w_{i 1} ∣ k} + β ) ( N _{\ i, w_{i 2} ∣ k} + β )}{( 2 N _{\ i, k} + W β ) ( 2 N _{\ i, k} + W β + 1 )},

z_{i, k} \propto (N_{\ i, k} + α) \frac{( N _{\ i, w_{i 1} ∣ k} + β ) ( N _{\ i, w_{i 2} ∣ k} + β )}{( 2 N _{\ i, k} + W β ) ( 2 N _{\ i, k} + W β + 1 )},

\hat{N}_{k} = ∣ B ∣ z_{i, k},

\hat{N}_{k} = ∣ B ∣ z_{i, k},

\hat{N}_{w ∣ k} = {∣ B ∣ z_{i, k} 0 if w \in b_{i}, otherwise .

\hat{N}_{w ∣ k} = {∣ B ∣ z_{i, k} 0 if w \in b_{i}, otherwise .

N_{k} \leftarrow (1 - ρ_{t}) N_{k} + ρ_{t} \hat{N}_{k},

N_{k} \leftarrow (1 - ρ_{t}) N_{k} + ρ_{t} \hat{N}_{k},

N_{w ∣ k} \leftarrow (1 - ρ_{t}) N_{w ∣ k} + ρ_{t} \hat{N}_{w ∣ k},

N_{w ∣ k} \leftarrow (1 - ρ_{t}) N_{w ∣ k} + ρ_{t} \hat{N}_{w ∣ k},

θ_{k} \propto N_{k} + α,

θ_{k} \propto N_{k} + α,

ϕ_{k, w} \propto N_{w ∣ k} + β .

ϕ_{k, w} \propto N_{w ∣ k} + β .

D_{α} [p ∣∣ q] = \frac{\int α p ( x ) + ( 1 - α ) q ( x ) - p ( x ) ^{α} q ( x ) ^{1 - α} d x}{α ( 1 - α )} .

D_{α} [p ∣∣ q] = \frac{\int α p ( x ) + ( 1 - α ) q ( x ) - p ( x ) ^{α} q ( x ) ^{1 - α} d x}{α ( 1 - α )} .

D_{- 1} [p ∣∣ q] = \frac{1}{2} \int \frac{( q ( x ) - p ( x ) ) ^{2}}{p ( x )} d x,

D_{- 1} [p ∣∣ q] = \frac{1}{2} \int \frac{( q ( x ) - p ( x ) ) ^{2}}{p ( x )} d x,

α \to 0 lim D_{α} [p ∣∣ q] = KL [q ∣∣ p],

α \to 0 lim D_{α} [p ∣∣ q] = KL [q ∣∣ p],

α \to 1 lim D_{α} [p ∣∣ q] = KL [p ∣∣ q] .

α \to 1 lim D_{α} [p ∣∣ q] = KL [p ∣∣ q] .

argmin_{q (x_{i})} D_{α} [p (x_{i} ∣ x_{\ i}) q (x_{\ i}) ∣∣ q (x)],

argmin_{q (x_{i})} D_{α} [p (x_{i} ∣ x_{\ i}) q (x_{\ i}) ∣∣ q (x)],

q (x_{i}) \propto E_{q (x_{\ i})} [(\frac{p ( x )}{q ( x _{\ i} )})^{α}]^{\frac{1}{α}} .

q (x_{i}) \propto E_{q (x_{\ i})} [(\frac{p ( x )}{q ( x _{\ i} )})^{α}]^{\frac{1}{α}} .

q (x_{i}) \propto E_{q (x_{\ i})} [(p (x_{i} ∣ x_{\ i}) \frac{p ( x _{\ i} )}{q ( x _{\ i} )})^{α}]^{\frac{1}{α}} .

q (x_{i}) \propto E_{q (x_{\ i})} [(p (x_{i} ∣ x_{\ i}) \frac{p ( x _{\ i} )}{q ( x _{\ i} )})^{α}]^{\frac{1}{α}} .

q (x_{i}) \propto E_{q (x_{\ i})} [(p (x_{i} ∣ x_{\ i}))^{α}]^{\frac{1}{α}} .

q (x_{i}) \propto E_{q (x_{\ i})} [(p (x_{i} ∣ x_{\ i}))^{α}]^{\frac{1}{α}} .

q (B, z) = i = 1 \prod N_{B} q (b_{i}, z_{i}) .

q (B, z) = i = 1 \prod N_{B} q (b_{i}, z_{i}) .

q^{*} (B, z) = argmin_{q (B, z)} D_{α} [p (B, z) ∣∣ q (B, z)],

q^{*} (B, z) = argmin_{q (B, z)} D_{α} [p (B, z) ∣∣ q (B, z)],

p (b_{i}, z_{i} = k) \propto (n_{\ i, k} + γ) \frac{( n _{\ i, w_{i 1} ∣ k} + β ) ( n _{\ i, w_{i 2} ∣ k} + β )}{( n _{\ i, \cdot ∣ k} + W β ) ( n _{\ i, \cdot ∣ k} + W β + 1 )},

p (b_{i}, z_{i} = k) \propto (n_{\ i, k} + γ) \frac{( n _{\ i, w_{i 1} ∣ k} + β ) ( n _{\ i, w_{i 2} ∣ k} + β )}{( n _{\ i, \cdot ∣ k} + W β ) ( n _{\ i, \cdot ∣ k} + W β + 1 )},

q (b_{i}, z_{i} = k) \propto \frac{a _{k}^{\ i} b _{k, w_{i 1}}^{\ i} b _{k, w_{i 2}}^{\ i}}{c _{k}^{\ i} ( c _{k}^{\ i} + 1 )},

q (b_{i}, z_{i} = k) \propto \frac{a _{k}^{\ i} b _{k, w_{i 1}}^{\ i} b _{k, w_{i 2}}^{\ i}}{c _{k}^{\ i} ( c _{k}^{\ i} + 1 )},

a_{k}^{\ i} = \tilde{n}_{\ i, k} + γ,

a_{k}^{\ i} = \tilde{n}_{\ i, k} + γ,

b_{k, w}^{\ i} = \tilde{n}_{\ i, w ∣ k} + β,

b_{k, w}^{\ i} = \tilde{n}_{\ i, w ∣ k} + β,

c_{k}^{\ i} = \tilde{n}_{\ i, \cdot ∣ k} + W β .

c_{k}^{\ i} = \tilde{n}_{\ i, \cdot ∣ k} + W β .

q^{\ a} (b_{i}, z_{i} = k) = \frac{b _{k, w_{i 1}}^{\ i} b _{k, w_{i 2}}^{\ i}}{c _{k}^{\ i} ( c _{k}^{\ i} + 1 )},

q^{\ a} (b_{i}, z_{i} = k) = \frac{b _{k, w_{i 1}}^{\ i} b _{k, w_{i 2}}^{\ i}}{c _{k}^{\ i} ( c _{k}^{\ i} + 1 )},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

∎

11institutetext: Zhenghang Cui

[email protected]

Issei Sato

[email protected]

Masashi Sugiyama

[email protected]

1The University of Tokyo, Japan

2RIKEN, Japan

Stochastic Divergence Minimization for Biterm Topic Model

Zhenghang Cui1

Issei Sato*1

2]

Masashi Sugiyama*2

1]

(Received: date / Accepted: date)

Abstract

As the emergence and the thriving development of social networks, a huge number of short texts are accumulated and need to be processed. Inferring latent topics of collected short texts is useful for understanding its hidden structure and predicting new contents. Unlike conventional topic models such as latent Dirichlet allocation (LDA), a biterm topic model (BTM) was recently proposed for short texts to overcome the sparseness of document-level word co-occurrences by directly modeling the generation process of word pairs. Stochastic inference algorithms based on collapsed Gibbs sampling (CGS) and collapsed variational inference have been proposed for BTM. However, they either require large computational complexity, or rely on very crude estimation. In this work, we develop a stochastic divergence minimization inference algorithm for BTM to estimate latent topics more accurately in a scalable way. Experiments demonstrate the superiority of our proposed algorithm compared with existing inference algorithms.

Keywords:

Short text, topic model, biterm, stochastic inference algorithm

††journal: Machine Learning

1 Introduction

As social network services are dominant in people’s daily life, a huge number of short text data has been accumulated. At the same time, other data which can be found on traditional web pages, such as article titles or public forum comments can also be regarded as possessing the same attribute of short length. It would be an essential and interesting task to explore their inner structure for a wide range of applications, such as classification based on contents, or prediction for future documents that have not emerged yet. Because of the document level word co-occurrence sparsity caused by short document length, conventional topic models such as probabilistic latent semantic indexing (pLSA) (Hofmann, 1999) or latent Dirichlet allocation (LDA) (Blei et al, 2003) fail to show favorable inference performance on data sets consisting of short texts. A biterm topic model (BTM) (Cheng et al, 2014) was proposed to alleviate this problem caused by document level word co-occurrence sparsity. Instead of each single word, the generation process of each unordered combination of two words, or a biterm, is modeled in BTM. Each biterm is assumed to be assigned with one topic. Compared to conventional topic models, this modification makes BTM less sensitive to the shortness of each document, and more stable to clearly reveal the relationship between words. By modeling the word co-occurrences explicitly and combining words into biterms, it has been shown by experiments (Cheng et al, 2014) that BTM successfully alleviates the problem caused by document level word co-occurrence sparsity and keeps the generality and flexibility at the same time.

For inferring model parameters and estimating latent topics for BTM, a batch inference algorithm based on collapsed Gibbs sampling (CGS) is first proposed together with the model (Cheng et al, 2014) to approximate the true posterior distribution of parameters. Based on this batch CGS inference algorithm, two online algorithms are proposed (Cheng et al, 2014) to scale up for data sets of large size. One online algorithm is based on the idea of updating hyperparameters between time slices, which is inspired by the online LDA algorithm (AlSumait et al, 2008), while the other online algorithm is based on the idea of resampling topics of observed biterms for sufficient times after a new biterm is observed, which is inspired by an incremental Gibbs sampler for LDA (Canini et al, 2009). On the other hand, based on the idea of zero-order stochastic collapsed variational Bayesian inference (SCVB0) for LDA (Foulds et al, 2013), a similar SCVB0 algorithm for BTM was proposed for better latent topics estimation (Awaya et al, 2016). However, these online algorithms are either not working very efficiently on memory usage, or relying on very crude estimation.

In this paper, we propose a stochastic divergence minimization (SDM) inference algorithm for BTM based on minimizing the $\alpha$ -divergence to estimate latent topics more accurately. First, inspired by the work for LDA (Sato and Nakagawa, 2012), we reconstruct collapsed variational Bayesian inference which uses only the zero-order Taylor series approximation (CVB0) as an optimization problem of $\alpha$ -divergence minimization. Then, we apply a stochastic approximation method to this optimization problem to develop a stochastic inference algorithm.

For a general probabilistic model, CGS inference algorithms try to find a posterior distribution, while variational Bayesian (VB) inference algorithms try to find a closest distribution within a function family. (Beal, 2003) The closeness is usually measured by the KL-divergence. VB transforms the original inference problem to an optimization problem, which can be solved by a simple gradient descent algorithm. Similarly to the manipulation in CGS, collapsed variational Bayesian (CVB) marginalizes out unconcerned parameters and only infer latent parameters. For example, CVB for BTM (Awaya et al, 2016) marginalized out model parameters, which form a vector indicating the topic proportion and the matrix indicating the word distribution for each topic, and only calculated the posterior distribution for latent parameters indicating topic assignments of biterms. CGS algorithms usually converge slower and are strongly influenced by the initial state of parameters due to the inner characteristic of a Monte Carlo Markov Chain sampling algorithm. On the other hand, CVB is a deterministic algorithm. Empirically, it converges faster and performs better (Asuncion et al, 2009).

Since exact evaluation of expectations in the CVB formula is intractable, the idea of using only the zero-order term of its Taylor series as a rough approximation is appealing. This results in a zero-order CVB inference algorithm (CVB0) proposed for BTM (Awaya et al, 2016). Based on CVB0, stochastic approximations are developed to scale up the algorithm for huge data sets (Awaya et al, 2016). However, the reason why zero-order approximation is used instead of higher order approximations is not clearly explained. Furthermore, although the SCVB0 for the BTM algorithm utilizes a scale coefficient to reduce the computational complexity of each iteration from $\mathcal{O}(W)$ to $\mathcal{O}(1)$ , where $W$ denotes the size of the vocabulary, the risk of arithmetic underflow in floating point calculations always exists when processing data sets of very large size. It also utilizes a very crude approximation for essential statistics at each iteration.

Contributions

Considering the issues discussed above, we propose a novel SDM inference algorithm for BTM. We have three main contributions listed as follows.

•

We provide a novel formulation of SCVB0 inference for BTM from the perspective of $\alpha$ -divergence minimization. This provides a new means to understand the inner attribute of SCVB0 inference for BTM. This is inspired by the similar work developed for LDA (Sato and Nakagawa, 2012).

•

We derive an SDM algorithm for BTM based on the $\alpha$ -divergence minimization formulation of SCVB0. SDM for BTM is an one-pass algorithm, which means it processes each biterm only once and stops when all biterms have been processed. Compared to SCVB0 for BTM, SDM for BTM requires the same amount of memory and has the same computational complexity for processing a single biterm. On the other hand, SCVB0 does not preserve the sufficient statistics for the counts of each word and has the risk of arithmetic underflow in floating point calculations. SDM for BTM does not have these problems and provides better approximation. Experiments reveal that SDM for BTM can estimate latent topics more accurately, and thus can predict documents which have not emerged yet with higher accuracy than existing methods.

•

We analyze the convergence of our proposed method by using Martingale convergence theory.

The remainder of this paper is organized as follows. In Section 2, we introduce the related works on BTM, its existing inference algorithms, and the theoretical background for SDM. In Section 3, we introduce our proposed SDM algorithm. In Section 4, we conduct experiments to evaluate our proposed method against existing methods and discuss the result. In Section 5, we conclude this paper.

2 Related Works

In this section, we will introduce the biterm topic model (BTM) (Cheng et al, 2014), followed by its batch and online inference algorithms. Essential information for $\alpha$ -divergence is presented at the end of this section.

2.1 BTM

Conventional topic models such as LDA usually fail to show satisfactory performance on short text data sets. BTM was proposed to alleviate this problem by modifying the word generating part of the graphical model. Instead of modeling the generation of each word, BTM directly models the generation of biterms, which are unordered combinations of two words. For example, a document of $n$ words will generate $\binom{n}{2}$ combinations of two words. Compared to conventional topic models, this modification makes BTM less sensitive to the short length of each document, and biterms are more stable to clearly reveal the relationship between words. Based on the original paper (Cheng et al, 2014), the notation is listed as follows.

•

A data set contains $N_{B}$ biterms, where each biterm is denoted by $b_{i}=\{w_{i1},w_{i2}\}$ .

•

The number of topics is denoted by $K$ .

•

The size of vocabulary is denoted by $W$ .

•

A topic proportion vector is denoted by $\theta$ . Its length is $K$ and all of its entries sum to $1$ .

•

A word distribution matrix is denoted by $\Phi$ . Its size is $K\times W$ . Each row vector $\phi_{k}$ has length $W$ and sums to $1$ .

•

A topic indicator variable for biterm $b_{i}$ is denoted by $z_{i}$ . It has a length of $K$ and all of its entries sum to $1$ .

The generative process is described formally as follows.

Draw $\theta$ $\sim$ Dirichlet( $\gamma$ ) 2. 2.

For each topic $k$

(a)

Draw $\phi_{k}$ $\sim$ Dirichlet( $\beta$ ) 3. 3.

For each biterm $b_{i}$

(a)

Draw $z_{i}$ $\sim$ Multinomial( $\theta$ ) 2. (b)

Draw $w_{i1},w_{i2}$ $\sim$ Multinomial( $\phi_{z_{i}}$ )

Here, Dirichlet( $\gamma$ ) denotes a Dirichlet distribution with parameter $\gamma$ , and Multinomial( $\theta$ ) denotes a multinomial distribution with parameter $\theta$ . The graphical model of BTM is shown in Fig. 1.

Following the generation process, we can express the likelihood of a data set $B$ conditioned on model parameters $\theta$ and $\Phi$ as

[TABLE]

2.2 Batch Inference Algorithm

Here, we will concisely introduce the batch inference algorithm which estimates all of the three parameters, $z$ which indicates the topic assignments, $\theta$ which indicates the topic proportion and $\Phi$ which indicates the word distribution for each topic. Since it is intractable to compute the exact posterior distributions of these parameters, the CGS algorithm is used to approximate the true posterior distributions (Cheng et al, 2014). Parameters $\theta$ and $\Phi$ are first integrated out using conjugate priors, then $z_{i}$ for each biterm $b_{i}$ is sampled using the posterior distribution conditioned on all of the other variables. After processing all biterms, $\theta$ and $\Phi$ can be restored using $z$ . However, this can be a computational burden when the size of given data set is large, which motivates the development of stochastic inference algorithms that will be discussed in Section 2.3 and Section 2.4. The following formula is used to sample $z_{i}$ for each biterm $b_{i}$ :

[TABLE]

Let $z_{\backslash i}$ be the whole topic assignment vector without considering $b_{i}$ , $n_{\backslash i,k}$ be the count of biterms assigned to topic $k$ without counting $b_{i}$ and $n_{\backslash i,w|k}$ be the count of times that word $w$ is assigned to topic $k$ without counting $b_{i}$ . The dot in $n_{\backslash i,\cdot|k}$ means taking the sum over all words. After a sufficient number of iterations over the whole data set, we can restore $\theta$ and $\Phi$ using following formulas:

[TABLE]

where $n_{k}$ is the number of biterms assigned to topic $k$ and $n_{w|k}$ is the count of times that word $w$ is assigned to topic $k$ . The dot in $n_{\cdot|k}$ means taking the sum over all words.

2.3 Online BTM Algorithm

In recent real-world inference problems, the size of data to analysis is usually very large and keeps increasing. To deal with such large data, it would be useful to develop algorithms that can handle data in the streaming form. In the original paper (Cheng et al, 2014), two kinds of algorithms have been introduced to deal with data sets of very large size. The online BTM algorithm will be introduced here and the incremental BTM algorithm will be introduced in Section 2.4.

The idea of the online BTM algorithm is inspired by the similar algorithm proposed for LDA (AlSumait et al, 2008). The data set is supposed to be separated in multiple time-slices, e.g., hourly, daily or weekly. Within the processing of a single time-slice sample, hyperparameters $\gamma$ and $\beta$ are updated using statistics of data in this time slice. After a sufficient number of iterations, parameters $\theta$ and $\Phi$ can be restored to reflect the influence of this time slice.

The notations are described as follows. A biterm set of time $t$ is denoted by $B^{(t)}$ . The number of biterms assigned to topic $k$ within $B^{(t)}$ is denoted by $n_{k}^{(t)}$ . The number of times word $w$ is assigned to topic $k$ within $B^{(t)}$ is denoted by $n_{w|k}^{(t)}$ . Hyperparameters for $\theta$ are denoted by vector $\{\gamma_{1},\ldots,\gamma_{k}\}$ and hyperparameters for $\Phi$ are denoted by matrix $\{\beta_{1},\ldots,\beta_{K}\}$ , where $\beta_{k}$ is a vector consisting of $\{\beta_{k,1},\ldots,\beta_{k,W}\}$ . The conditional distribution for sampling each topic $z_{i}$ is given by

[TABLE]

After the processing of each time-slice sample, hyperparamters can be updated as

[TABLE]

where the decay weight is denoted by $\lambda\in[0,1]$ . It controls the dependency to data in past time slices. The details of the procedure are described in Alg. 1.

2.4 Incremental BTM Algorithm

Although the online BTM algorithm can be adapted to sequential data, updating parameters immediately after a biterm arrived may be essential in some situations. The incremental BTM algorithm is proposed for this purpose. It can update parameters after the arrival of each single biterm.

The idea of the incremental BTM algorithm is inspired by the incremental Gibbs sampler (Canini et al, 2009). Specifically, the main task is that after the arrival of a new biterm, when the routine of sampling its topic ends, a biterm sequence called a rejuvenation sequence will be constructed on the run and the topic of all biterms belonging to this sequence will be resampled. Apparently, the length and the choice of the rejuvenation sequence would influence the performance profoundly. For convenience, the sequence length is regarded as a hyperparameter and the uniform distribution is used to generate it.

The details of the procedure are described in Alg. 2.

2.5 SCVB0 Algorithm for BTM

The batch algorithm, CVB0 for BTM, will be introduced following its stochastic formulation, SCVB0 for BTM.

CVB0 for BTM (Awaya et al, 2016) is inspired by CVB0 for LDA (Asuncion et al, 2009). Similarly to CGS, global parameters $\theta$ and $\Phi$ are first marginalized out and only inference for latent parameter $z$ is performed. A zero-order approximation of Taylor series is utilized because some expectations are intractable to evaluate. The updating formula for variational parameter $z_{i,k}$ can be deducted as

[TABLE]

where $N_{k}=\sum_{b_{i}\in B}z_{i,k}$ , $N_{w|k}=\sum_{b_{i}\in B_{w}}z_{i,k}$ , $B_{w}$ denotes the set of biterms containing word $w$ and $\backslash i$ means counting without considering $b_{i}$ .

SCVB0 for BTM is based on the idea of ignoring the subtraction of the current biterm and update statistics in a stochastic way. Storing all variational parameters is not necessary and a very crude estimation of $N_{k}$ and $N_{w|k}$ when a biterm $b_{i}$ is observed can be expressed by

[TABLE]

Then, $N_{k}$ and $N_{w|k}$ can be updated using the following formulas:

[TABLE]

where $\rho_{t}=1/(t+\tau)^{\kappa}$ denotes the step size.

To reduce the computational complexity of each update from $\mathcal{O}(W)$ to $\mathcal{O}(1)$ , the following technique is used to represent the value of $N_{w|k}$ . A scaling coefficient $a$ and a dummy matrix $A_{w|k}$ are in fact stored, where $N_{w|k}=a\,A_{w|k}$ is satisfied. Every time when $N_{w|k}$ is updated, one just needs to multiply $a$ by $(1-\rho_{t})$ and manually computes the values of $A_{w_{i1}|k}$ and $A_{w_{i2}|k}$ . This manipulation significantly reduces the computational complexity, but bears the risk of $a$ ’s underflow, because $(1-\rho_{t})$ is multiplied repeatedly during the algorithm.

After processing all of the biterms, global parameters can be restored using the following formulas:

[TABLE]

The details of the procedure are described in Alg. 3.

2.6 $\alpha$ -divergence

Here we briefly introduce the concepts of $\alpha$ -divergence and local divergence projection inference. More details can be found in Amari (1990) and Minka (2005).

Definition

The $\alpha$ -divergence can be perceived as a generalized KL divergence. We will denote its detailed definition using two distributions $p(x)$ and $q(x)$ . The $\alpha$ -divergence from $p(x)$ to $q(x)$ , indexed by $\alpha\in(-\infty,\infty)$ , is defined as

[TABLE]

Notice that $p(x)$ and $q(x)$ need not to be normalized before calculating the $\alpha$ -divergence. Some useful special cases of $\alpha$ are:

[TABLE]

Local $\alpha$ -divergence projection

Suppose that the distribution $q(x)$ we approximate can be fully factorized. That is, $q(x)=\prod_{i=1}^{n}q(x_{i})$ , where $x_{i}$ denotes the $i$ -th element of the vector $x$ . $x=(x_{1},x_{2},\ldots,x_{n})^{\top}$ , where $\top$ denotes the transpose. Depending on $p(x)$ , it is intractable to naively compute the $\alpha$ -divergence. To avoid this problem, we focus on each single $x_{i}$ and then optimize each $q(x_{i})$ by

[TABLE]

where $x_{\backslash i}$ represents all but $i$ -th entry of $x$ and $q(x)=q(x|x_{\backslash i})q(x_{\backslash i})$ . Its update formula can be obtained by taking the derivative of the $\alpha$ -divergence and equating it to zero:

[TABLE]

Since the expectation over $q(x_{\backslash i})$ is computationally intractable, we approximate it by first splitting the numerator as

[TABLE]

We then substitute $p(x_{\backslash i})$ with $q(x_{\backslash i})$ to obtain the following approximation:

[TABLE]

This is the method we will use in the derivation of a stochastic divergence minimization algorithm to approximate the $\alpha$ -divergence.

3 Proposed Method

In this section, we propose a novel SDM inference algorithm for BTM. We will first show the derivation of SDM. Then we will show its relation to the leave-one-out likelihood (LOO).

3.1 Derivation of Divergence Minimization

We assume the independence between latent topics of the given biterms as

[TABLE]

We then estimate this distribution by $\alpha$ -divergence minimization:

[TABLE]

where $q^{*}(B,z)$ denotes the optimized distribution.

Since it is intractable to compute this minimization, we consider the following local divergence minimization. First, noting that

[TABLE]

where the notations are the same as those in Eq. (2) for CGS.

We then reparameterize $q(b_{i},z_{i})$ as follows:

[TABLE]

Notice that $\tilde{n}_{\backslash i,k}$ , $\tilde{n}_{\backslash i,w|k}$ and $\tilde{n}_{\backslash i,\cdot|k}$ here are not counts. They are just parameters of the function defined above.

We also define

[TABLE]

Recalling that the $\alpha$ -divergence does not need the distributions to be normalized, we can define the following local projections:

[TABLE]

where

[TABLE]

Taking the derivative of Eq. (34) with respect to $a^{\backslash i}$ and equating it to zero yields

[TABLE]

With $\sum_{z_{\backslash i}}q(z_{\backslash i})=1$ , we can obtain

[TABLE]

Similarly, we can derive the solutions to the other optimization problems listed above as follows:

[TABLE]

If we use $\alpha$ -divergence projection with $\alpha=1$ for $a_{k}^{\backslash i}$ and $b_{k,w}^{\backslash i}$ , while using it with $\alpha=-1$ for $c_{k}^{\backslash i}$ , we can obtain the update formula for $q(z_{i})$ as

[TABLE]

which can also be obtained by SCVB0.

3.2 Relation to LOO Likelihood

Here, we investigate the relationship between divergence minimization and the LOO likelihood.

For the LOO prediction of a new biterm $b_{i}$ , its probability is given by

[TABLE]

On the other hand, using Eq. (46), we get

[TABLE]

This full likelihood is similar to the above LOO likelihood and can be regarded as an approximation to it. Considering the close relation between the LOO likelihood and the full likelihood, lowering the full likelihood indicated by $q(b_{i})$ can result in a lower likelihood of the data set. The close correlation between the LOO perplexity and test perplexity of LDA has been shown with detailed experimental results in Sato and Nakagawa (2015).

3.3 Derivation of SDM

For the three terms we defined in $\alpha$ -divergence minimization, we have $\mathbb{E}[n_{\backslash i|k}]=2\,\mathbb{E}[n_{\backslash i,\cdot|k}]$ . Therefore, $a_{k}^{\backslash i}$ can be restored from $c_{k}^{\backslash i}$ and we need only to compute the values of $b_{k}^{\backslash i}$ . $c_{k}^{\backslash i}$ can be calculated as $c_{k}^{\backslash i}=\sum_{w=1}^{W}b_{k,w}^{\backslash i}$ , so we can update it by $c_{k}^{\backslash i}\leftarrow c_{k}^{\backslash i}-(b_{k,w}^{\backslash i})^{(\textrm{old})}+(b_{k,w}^{\backslash i})^{(\textrm{new})}$ . For this reason, we will focus on the stochastic approximation for the term $b_{k,w}^{\backslash i}$ .

Recall that $b_{k,w}^{\backslash i}=\mathbb{E}[n_{\backslash i,w|k}]+\beta$ . We can rewrite it in the form of fixed point iteration with step size $\rho_{t}$ :

[TABLE]

We can then replace the term $\mathbb{E}[n_{\backslash i,w|k}]$ with approximation $\tilde{n}_{\backslash i,w|k}$ , which is defined by

[TABLE]

where $i^{\prime}\neq i$ is a random sample with $b_{i^{\prime}}$ also containing the word $w$ . We can then substitute it into the update formula:

[TABLE]

This update formula is actually a realization of update based on word vocabulary. Furthermore, we process all the biterms in an one-pass fashion, which means that each biterm is processed only once one by one until the end. For this reason, we focus on the perspective of word type update and reformulate the update as

[TABLE]

where $t(w)$ represents how many times the word $w_{j}$ has been updated so far and $\rho_{t(w)}=1/(1+t(w))^{\kappa}$ . In the context of stochastic approximation, we can simply make $b_{k,w}^{t(w)+1}=(b_{k,w}^{\backslash i})^{(t+1)}$ .

The detailed procedure is described in Alg. 4.

4 Experiments

We evaluate the effect of our proposed algorithm compared with existing inference algorithms.

4.1 Experimental Settings

We used the data set called Tweets2011 (http://trec.nist.gov/data/tweets) to evaluate the algorithms. Tweets2011 is a standard collection of tweets published between January 23rd and February 8th, 2011.

The raw data of Tweets2011 is very noisy and contains tweets in multiple languages. Many of the languages are difficult to perform morphological analysis such as Japanese. Therefore, in the preprocessing stage, we applied a filter to keep only English tweets. Then, we implemented essential preprocessing tasks such as stop word removal and punctuation removal. Finally, we removed documents consisting of only a single word since they can not form biterms and have no co-occurrence between different words.

After preprocessing, as we can see from the document length distribution shown in Fig. 2, most of the documents have length less than 10. There are around $3.5$ million documents and the average document length is $6.94$ words.

In most real world situations, the size of a given data set is too large to run a batch algorithm. Therefore, to better simulate practical circumstances, we considered the four algorithms, namely, the proposed SDM algorithm, the SCVB0 algorithm for BTM, the online BTM algorithm, and the incremental BTM algorithm, that can scale up to large data sets.

In order to show the convergence speed of each algorithm, and to compare all of the above algorithms at the same scale, we let the programs output the result together with the proportion of processed biterms from the beginning of each experiment. All of the experiments are conducted on an Ubuntu server with 2.9 GHz Intel Xeon E5-2667 CPU and 64 GB memory.

After the preprocessing, we obtained $3.5$ million short documents. We then formulated around $95$ million biterms from these short documents. We shuffled all of the biterms and divided them into two data sets: the training set and the test set. Their sizes are set to be around $4:1$ . It turns out that the training set contains about $75$ million biterms, and the test set contains about $20$ million biterms.

For three existing methods, hyperparameters are set in the same way as the original paper (Cheng et al, 2014). To assure both the CGS algorithms finish at a realistic time, we set the length of the rejuvenation sequence to $10$ for the incremental BTM algorithm and the inner iteration times to $10$ for the online BTM algorithm. For all of the algorithms, we set $\gamma$ to be $50/K$ and $\beta$ to be $0.01$ . As step size hyperparameters, $\tau=1000$ and $\kappa=0.8$ are used for SCVB0, while $\kappa=0.51$ is used for SDM.

Cheng et al (2014) chose to evaluate the coherence of observed topics to measure the performance of each algorithm, which requires us to use external data from sources such as Wikipedia for evaluating the pointwise mutual information. This result can be dependent on the size and the quality of the external data. Therefore, we decided to use the strategy called predictive sample re-use (Geisser, 1975) to evaluate the efficiency of each algorithm. This evaluation strategy simply measures the log likelihood sum of the test data set based on the model parameters calculated by each algorithm.

4.2 Evaluation

We demonstrate and discuss the performance of four inference algorithms. Each set of experiments is repeated 10 times.

In Fig. 3, the horizontal axis represents experimental settings under different topic numbers. In Fig. 4, the horizontal axis represents the percentage of documents that have been processed. In both of the figures, the vertical axis represents the mean value of the average log likelihood of the test set, which is the higher the more favorable. Their standard deviations are also shown by error bars over 10 runs. SDM-BTM represents the proposed SDM algorithm for BTM. SCVB0-BTM represents the SCVB0 algorithm for BTM. iBTM represents the incremental BTM algorithm. oBTM represents the online BTM algorithm.

The comparison of the update cost, memory and purpose between different algorithms is shown in Tab. 1, where $R$ denotes the length of the rejuvenation sequence and $B_{t}$ denotes the size of a biterm mini-batch.

As shown in Fig. 3 and Fig. 4, compared to online algorithms based on CGS, SDM-BTM shows a higher convergence speed and a better convergence result. This is because SDM-BTM is a deterministic algorithm while CGS algorithms are based on sampling. Compared to SCVB0-BTM, SDM-BTM performs better on a real world data set because SDM-BTM preserves the sufficient statistics correctly and is not exposed to the risk of arithmetic underflow.

5 Conclusions

In this paper, we first reviewed the BTM and its existing inference algorithms. We then reconstructed CVB0 inference for BTM and proposed a novel SDM inference algorithm, which is a stochastic inference algorithm that can be applied to practical circumstances. It outperformed existing methods in our experiments.

For future work, it would be interesting and essential to explore the relationship between the number of topics and the performance. Developing new inference algorithms based on other $\alpha$ -divergences or conducting experiments on various data sets and studying their results would also be important challenges.

Appendix: Convergence Proof

In this appendix, we will show that the proposed SDM algorithm does converge to the minimization of the divergence. This is not obvious since the term used in Eq. (51) is not a gradient of the divergence in Eq. (35) or Eq. (36).

We will proof the convergence of the stochastic approximation by Eq. (51) for the optimization problem in Eq. (35). The proof for Eq. (36) is similar, thus it would be omitted. The whole process is similar to but different from the appendix of Sato and Nakagawa (2015).

First, we repeat the definition of the stochastic optimization problem based on Martingale convergence theory. The optimization problem is to find $b^{*}=\textrm{argmin}_{b}f(b)$ and the update formula is $b^{(t+1)}=b^{(t)}+\rho_{t}s^{(t)}$ . We also let $\mathcal{F}^{(t)}$ to be the history of the variable sequence, which is also called filtration :

[TABLE]

Then we have the Martingale theory as follows.

Theorem 5.1

Assume step size $\rho_{t}$ , function $f$ and stochastic search direction $s^{(t)}$ satisfy following four conditions.

Step size $\rho_{t}$ is a non-negative scalar and satisfies

[TABLE] 2. 2.

Function $f$ is continuously differentiable and there exists come constant $L$ such that

[TABLE] 3. 3.

There exists a positive constant $C$ such that

[TABLE] 4. 4.

There exist positive constants $A$ and $B$ such that

[TABLE]

Then the update equation $b^{(t+1)}=b^{(t)}+\rho_{t}s^{(t)}$ satisfies the three holds with probability one.

The sequence $f(b^{(t)})$ converges. 2. 2.

$\lim_{t\to\infty}\nabla f(b^{(t)})=0$ . 3. 3.

Every limit point of $b^{(t)}$ is a stationary point of $f$ .

The detailed proof of above theory can be found in the super-Martingale convergence theorem (Bertsekas and Tsitsiklis, 2000).

First, we reform the stochastic update formula. Given the objective function

[TABLE]

We take its derivatives regarding to $b_{k,w_{i1}}^{\backslash i}$ :

[TABLE]

Next, we define the stochastic direction

[TABLE]

where $\xi_{i^{\prime},k}=n_{\backslash i,w_{i1}}q(z_{i^{\prime}}=k|b_{i^{\prime}})-\mathbb{E}[n_{\backslash i,w_{i1}|k}]$ and $b_{i^{\prime}}$ is another biterm that contains the word $w_{i1}$ . Then we rewrite the stochastic direction as

[TABLE]

The step size we actually use satisfies the first condition of the theorem 5.1 and it is not hard to show that the objective function $D_{1}(b_{k,w_{i1}}^{\backslash i})$ of minimization satisfies the second condition. Therefore, what left is to show the satisfaction of the third and forth condition. From the paper that introduced stochastic divergence minimization for LDA (Sato and Nakagawa, 2015), the proof of following lemma can be found.

Lemma 1

If the stochastic noise term $\xi^{(t)}$ satisfies the following conditions, then the stochastic direction $s^{(t)}$ satisfies the third and forth condition of the theorem 5.1.

$\{\xi^{(t)}\}$ * is a Maringale difference sequence with respect to filtration $\mathcal{F}^{(t)}$ , which means that $\mathbb{E}[\xi^{(t)}|\mathcal{F}^{(t)}]=0,\,\forall t>0$ .* 2. 2.

$\xi^{(t)}$ * has bounded variance. For example, it is square integrable with*

[TABLE]

for some constant $C$ .

Therefore, what left is to show the following lemma.

Lemma 2

The noise term $\xi^{(t)}$ satisfies the two conditions listed in the Lemma 1.

Proof

For the first condition, we show that

[TABLE]

Therefore the first condition is satisfied. Recall Eq. (59):

[TABLE]

We then define stochastic gradient as

[TABLE]

We confirm that

[TABLE]

The difference between the stochastic gradient and the real gradient is

[TABLE]

thus there exists a constant $C$ so that

[TABLE]

Thus

[TABLE]

We then introduce $D_{1}^{*}(b_{k,w_{i1}}^{\backslash i})\geq\tilde{D}_{1}(b_{k,w_{i1}}^{\backslash i})$ which is given by

[TABLE]

Therefore, there exists another constant $G$ such that

[TABLE]

for example,

[TABLE]

Therefore, we can say

[TABLE]

which satisfies the second condition. ∎

To conclude, the convergence of our stochastic approximation is proved.

Bibliography15

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Al Sumait et al (2008) Al Sumait L, Barbará D, Domeniconi C (2008) On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. pp 3–12
2Amari (1990) Amari Si (1990) Differential-geometrical methods in statistics. Lecture notes in statistics, Springer-Verlag, Berlin, Heidelberg
3Asuncion et al (2009) Asuncion A, Welling M, Smyth P, Teh YW (2009) On smoothing and inference for topic models. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, AUAI Press, Arlington, Virginia, United States, UAI ’09, pp 27–34, URL http://dl.acm.org/citation.cfm?id=1795114.1795118
4Awaya et al (2016) Awaya N, Kitazono J, Omori T, Ozawa S (2016) Stochastic collapsed variational bayesian inference for biterm topic model. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp 3364–3370
5Beal (2003) Beal MJ (2003) Variational algorithms for approximate bayesian inference
6Bertsekas and Tsitsiklis (2000) Bertsekas DP, Tsitsiklis JN (2000) Gradient convergence in gradient methods with errors. SIAM Journal on Optimization 10(3):627–642
7Blei et al (2003) Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
8Canini et al (2009) Canini KR, Shi L, Griths TL (2009) Online inference of topics with latent dirichlet allocation. In: Proceedings of AI Stats

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Stochastic Divergence Minimization for Biterm Topic Model

Abstract

Keywords:

1 Introduction

Contributions

2 Related Works

2.1 BTM

2.2 Batch Inference Algorithm

2.3 Online BTM Algorithm

2.4 Incremental BTM Algorithm

2.5 SCVB0 Algorithm for BTM

2.6 α\alphaα-divergence

Definition

Local α\alphaα-divergence projection

3 Proposed Method

3.1 Derivation of Divergence Minimization

3.2 Relation to LOO Likelihood

3.3 Derivation of SDM

4 Experiments

4.1 Experimental Settings

4.2 Evaluation

5 Conclusions

Appendix: Convergence Proof

Theorem 5.1

Lemma 1

Lemma 2

Proof

2.6 $\alpha$ -divergence

Local $\alpha$ -divergence projection