Stochastic Divergence Minimization for Biterm Topic Model
Zhenghang Cui, Issei Sato, Masashi Sugiyama

TL;DR
This paper introduces a stochastic divergence minimization algorithm for the Biterm Topic Model, improving the accuracy and scalability of short text topic inference compared to existing methods.
Contribution
It proposes a novel stochastic inference algorithm for BTM that reduces computational complexity and enhances estimation accuracy over prior approaches.
Findings
The new algorithm outperforms existing inference methods in experiments.
It achieves better scalability for large short text datasets.
Demonstrates improved accuracy in latent topic estimation.
Abstract
As the emergence and the thriving development of social networks, a huge number of short texts are accumulated and need to be processed. Inferring latent topics of collected short texts is useful for understanding its hidden structure and predicting new contents. Unlike conventional topic models such as latent Dirichlet allocation (LDA), a biterm topic model (BTM) was recently proposed for short texts to overcome the sparseness of document-level word co-occurrences by directly modeling the generation process of word pairs. Stochastic inference algorithms based on collapsed Gibbs sampling (CGS) and collapsed variational inference have been proposed for BTM. However, they either require large computational complexity, or rely on very crude estimation. In this work, we develop a stochastic divergence minimization inference algorithm for BTM to estimate latent topics more accurately in a…
| Update Cost | Memory | Purpose | |
|---|---|---|---|
| iBTM | Post. Approx. | ||
| oBTM | Post. Approx. | ||
| SCVB0-BTM | Post. Approx. | ||
| SDM-BTM | LOO Est. |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
∎
11institutetext: Zhenghang Cui
Issei Sato
Masashi Sugiyama
1The University of Tokyo, Japan
2RIKEN, Japan
Stochastic Divergence Minimization for Biterm Topic Model
Zhenghang Cui1
Issei Sato*1
2]
Masashi Sugiyama*2
1]
(Received: date / Accepted: date)
Abstract
As the emergence and the thriving development of social networks, a huge number of short texts are accumulated and need to be processed. Inferring latent topics of collected short texts is useful for understanding its hidden structure and predicting new contents. Unlike conventional topic models such as latent Dirichlet allocation (LDA), a biterm topic model (BTM) was recently proposed for short texts to overcome the sparseness of document-level word co-occurrences by directly modeling the generation process of word pairs. Stochastic inference algorithms based on collapsed Gibbs sampling (CGS) and collapsed variational inference have been proposed for BTM. However, they either require large computational complexity, or rely on very crude estimation. In this work, we develop a stochastic divergence minimization inference algorithm for BTM to estimate latent topics more accurately in a scalable way. Experiments demonstrate the superiority of our proposed algorithm compared with existing inference algorithms.
Keywords:
Short text, topic model, biterm, stochastic inference algorithm
††journal: Machine Learning
1 Introduction
As social network services are dominant in people’s daily life, a huge number of short text data has been accumulated. At the same time, other data which can be found on traditional web pages, such as article titles or public forum comments can also be regarded as possessing the same attribute of short length. It would be an essential and interesting task to explore their inner structure for a wide range of applications, such as classification based on contents, or prediction for future documents that have not emerged yet. Because of the document level word co-occurrence sparsity caused by short document length, conventional topic models such as probabilistic latent semantic indexing (pLSA) (Hofmann, 1999) or latent Dirichlet allocation (LDA) (Blei et al, 2003) fail to show favorable inference performance on data sets consisting of short texts. A biterm topic model (BTM) (Cheng et al, 2014) was proposed to alleviate this problem caused by document level word co-occurrence sparsity. Instead of each single word, the generation process of each unordered combination of two words, or a biterm, is modeled in BTM. Each biterm is assumed to be assigned with one topic. Compared to conventional topic models, this modification makes BTM less sensitive to the shortness of each document, and more stable to clearly reveal the relationship between words. By modeling the word co-occurrences explicitly and combining words into biterms, it has been shown by experiments (Cheng et al, 2014) that BTM successfully alleviates the problem caused by document level word co-occurrence sparsity and keeps the generality and flexibility at the same time.
For inferring model parameters and estimating latent topics for BTM, a batch inference algorithm based on collapsed Gibbs sampling (CGS) is first proposed together with the model (Cheng et al, 2014) to approximate the true posterior distribution of parameters. Based on this batch CGS inference algorithm, two online algorithms are proposed (Cheng et al, 2014) to scale up for data sets of large size. One online algorithm is based on the idea of updating hyperparameters between time slices, which is inspired by the online LDA algorithm (AlSumait et al, 2008), while the other online algorithm is based on the idea of resampling topics of observed biterms for sufficient times after a new biterm is observed, which is inspired by an incremental Gibbs sampler for LDA (Canini et al, 2009). On the other hand, based on the idea of zero-order stochastic collapsed variational Bayesian inference (SCVB0) for LDA (Foulds et al, 2013), a similar SCVB0 algorithm for BTM was proposed for better latent topics estimation (Awaya et al, 2016). However, these online algorithms are either not working very efficiently on memory usage, or relying on very crude estimation.
In this paper, we propose a stochastic divergence minimization (SDM) inference algorithm for BTM based on minimizing the -divergence to estimate latent topics more accurately. First, inspired by the work for LDA (Sato and Nakagawa, 2012), we reconstruct collapsed variational Bayesian inference which uses only the zero-order Taylor series approximation (CVB0) as an optimization problem of -divergence minimization. Then, we apply a stochastic approximation method to this optimization problem to develop a stochastic inference algorithm.
For a general probabilistic model, CGS inference algorithms try to find a posterior distribution, while variational Bayesian (VB) inference algorithms try to find a closest distribution within a function family. (Beal, 2003) The closeness is usually measured by the KL-divergence. VB transforms the original inference problem to an optimization problem, which can be solved by a simple gradient descent algorithm. Similarly to the manipulation in CGS, collapsed variational Bayesian (CVB) marginalizes out unconcerned parameters and only infer latent parameters. For example, CVB for BTM (Awaya et al, 2016) marginalized out model parameters, which form a vector indicating the topic proportion and the matrix indicating the word distribution for each topic, and only calculated the posterior distribution for latent parameters indicating topic assignments of biterms. CGS algorithms usually converge slower and are strongly influenced by the initial state of parameters due to the inner characteristic of a Monte Carlo Markov Chain sampling algorithm. On the other hand, CVB is a deterministic algorithm. Empirically, it converges faster and performs better (Asuncion et al, 2009).
Since exact evaluation of expectations in the CVB formula is intractable, the idea of using only the zero-order term of its Taylor series as a rough approximation is appealing. This results in a zero-order CVB inference algorithm (CVB0) proposed for BTM (Awaya et al, 2016). Based on CVB0, stochastic approximations are developed to scale up the algorithm for huge data sets (Awaya et al, 2016). However, the reason why zero-order approximation is used instead of higher order approximations is not clearly explained. Furthermore, although the SCVB0 for the BTM algorithm utilizes a scale coefficient to reduce the computational complexity of each iteration from to , where denotes the size of the vocabulary, the risk of arithmetic underflow in floating point calculations always exists when processing data sets of very large size. It also utilizes a very crude approximation for essential statistics at each iteration.
Contributions
Considering the issues discussed above, we propose a novel SDM inference algorithm for BTM. We have three main contributions listed as follows.
- •
We provide a novel formulation of SCVB0 inference for BTM from the perspective of -divergence minimization. This provides a new means to understand the inner attribute of SCVB0 inference for BTM. This is inspired by the similar work developed for LDA (Sato and Nakagawa, 2012).
- •
We derive an SDM algorithm for BTM based on the -divergence minimization formulation of SCVB0. SDM for BTM is an one-pass algorithm, which means it processes each biterm only once and stops when all biterms have been processed. Compared to SCVB0 for BTM, SDM for BTM requires the same amount of memory and has the same computational complexity for processing a single biterm. On the other hand, SCVB0 does not preserve the sufficient statistics for the counts of each word and has the risk of arithmetic underflow in floating point calculations. SDM for BTM does not have these problems and provides better approximation. Experiments reveal that SDM for BTM can estimate latent topics more accurately, and thus can predict documents which have not emerged yet with higher accuracy than existing methods.
- •
We analyze the convergence of our proposed method by using Martingale convergence theory.
The remainder of this paper is organized as follows. In Section 2, we introduce the related works on BTM, its existing inference algorithms, and the theoretical background for SDM. In Section 3, we introduce our proposed SDM algorithm. In Section 4, we conduct experiments to evaluate our proposed method against existing methods and discuss the result. In Section 5, we conclude this paper.
2 Related Works
In this section, we will introduce the biterm topic model (BTM) (Cheng et al, 2014), followed by its batch and online inference algorithms. Essential information for -divergence is presented at the end of this section.
2.1 BTM
Conventional topic models such as LDA usually fail to show satisfactory performance on short text data sets. BTM was proposed to alleviate this problem by modifying the word generating part of the graphical model. Instead of modeling the generation of each word, BTM directly models the generation of biterms, which are unordered combinations of two words. For example, a document of words will generate combinations of two words. Compared to conventional topic models, this modification makes BTM less sensitive to the short length of each document, and biterms are more stable to clearly reveal the relationship between words. Based on the original paper (Cheng et al, 2014), the notation is listed as follows.
- •
A data set contains biterms, where each biterm is denoted by .
- •
The number of topics is denoted by .
- •
The size of vocabulary is denoted by .
- •
A topic proportion vector is denoted by . Its length is and all of its entries sum to .
- •
A word distribution matrix is denoted by . Its size is . Each row vector has length and sums to .
- •
A topic indicator variable for biterm is denoted by . It has a length of and all of its entries sum to .
The generative process is described formally as follows.
Draw Dirichlet() 2. 2.
For each topic
- (a)
Draw Dirichlet() 3. 3.
For each biterm
- (a)
Draw Multinomial() 2. (b)
Draw Multinomial()
Here, Dirichlet() denotes a Dirichlet distribution with parameter , and Multinomial() denotes a multinomial distribution with parameter . The graphical model of BTM is shown in Fig. 1.
Following the generation process, we can express the likelihood of a data set conditioned on model parameters and as
[TABLE]
2.2 Batch Inference Algorithm
Here, we will concisely introduce the batch inference algorithm which estimates all of the three parameters, which indicates the topic assignments, which indicates the topic proportion and which indicates the word distribution for each topic. Since it is intractable to compute the exact posterior distributions of these parameters, the CGS algorithm is used to approximate the true posterior distributions (Cheng et al, 2014). Parameters and are first integrated out using conjugate priors, then for each biterm is sampled using the posterior distribution conditioned on all of the other variables. After processing all biterms, and can be restored using . However, this can be a computational burden when the size of given data set is large, which motivates the development of stochastic inference algorithms that will be discussed in Section 2.3 and Section 2.4. The following formula is used to sample for each biterm :
[TABLE]
Let be the whole topic assignment vector without considering , be the count of biterms assigned to topic without counting and be the count of times that word is assigned to topic without counting . The dot in means taking the sum over all words. After a sufficient number of iterations over the whole data set, we can restore and using following formulas:
[TABLE]
[TABLE]
where is the number of biterms assigned to topic and is the count of times that word is assigned to topic . The dot in means taking the sum over all words.
2.3 Online BTM Algorithm
In recent real-world inference problems, the size of data to analysis is usually very large and keeps increasing. To deal with such large data, it would be useful to develop algorithms that can handle data in the streaming form. In the original paper (Cheng et al, 2014), two kinds of algorithms have been introduced to deal with data sets of very large size. The online BTM algorithm will be introduced here and the incremental BTM algorithm will be introduced in Section 2.4.
The idea of the online BTM algorithm is inspired by the similar algorithm proposed for LDA (AlSumait et al, 2008). The data set is supposed to be separated in multiple time-slices, e.g., hourly, daily or weekly. Within the processing of a single time-slice sample, hyperparameters and are updated using statistics of data in this time slice. After a sufficient number of iterations, parameters and can be restored to reflect the influence of this time slice.
The notations are described as follows. A biterm set of time is denoted by . The number of biterms assigned to topic within is denoted by . The number of times word is assigned to topic within is denoted by . Hyperparameters for are denoted by vector and hyperparameters for are denoted by matrix , where is a vector consisting of . The conditional distribution for sampling each topic is given by
[TABLE]
After the processing of each time-slice sample, hyperparamters can be updated as
[TABLE]
[TABLE]
where the decay weight is denoted by . It controls the dependency to data in past time slices. The details of the procedure are described in Alg. 1.
2.4 Incremental BTM Algorithm
Although the online BTM algorithm can be adapted to sequential data, updating parameters immediately after a biterm arrived may be essential in some situations. The incremental BTM algorithm is proposed for this purpose. It can update parameters after the arrival of each single biterm.
The idea of the incremental BTM algorithm is inspired by the incremental Gibbs sampler (Canini et al, 2009). Specifically, the main task is that after the arrival of a new biterm, when the routine of sampling its topic ends, a biterm sequence called a rejuvenation sequence will be constructed on the run and the topic of all biterms belonging to this sequence will be resampled. Apparently, the length and the choice of the rejuvenation sequence would influence the performance profoundly. For convenience, the sequence length is regarded as a hyperparameter and the uniform distribution is used to generate it.
The details of the procedure are described in Alg. 2.
2.5 SCVB0 Algorithm for BTM
The batch algorithm, CVB0 for BTM, will be introduced following its stochastic formulation, SCVB0 for BTM.
CVB0 for BTM (Awaya et al, 2016) is inspired by CVB0 for LDA (Asuncion et al, 2009). Similarly to CGS, global parameters and are first marginalized out and only inference for latent parameter is performed. A zero-order approximation of Taylor series is utilized because some expectations are intractable to evaluate. The updating formula for variational parameter can be deducted as
[TABLE]
where , , denotes the set of biterms containing word and means counting without considering .
SCVB0 for BTM is based on the idea of ignoring the subtraction of the current biterm and update statistics in a stochastic way. Storing all variational parameters is not necessary and a very crude estimation of and when a biterm is observed can be expressed by
[TABLE]
[TABLE]
Then, and can be updated using the following formulas:
[TABLE]
[TABLE]
where denotes the step size.
To reduce the computational complexity of each update from to , the following technique is used to represent the value of . A scaling coefficient and a dummy matrix are in fact stored, where is satisfied. Every time when is updated, one just needs to multiply by and manually computes the values of and . This manipulation significantly reduces the computational complexity, but bears the risk of ’s underflow, because is multiplied repeatedly during the algorithm.
After processing all of the biterms, global parameters can be restored using the following formulas:
[TABLE]
[TABLE]
The details of the procedure are described in Alg. 3.
2.6 -divergence
Here we briefly introduce the concepts of -divergence and local divergence projection inference. More details can be found in Amari (1990) and Minka (2005).
Definition
The -divergence can be perceived as a generalized KL divergence. We will denote its detailed definition using two distributions and . The -divergence from to , indexed by , is defined as
[TABLE]
Notice that and need not to be normalized before calculating the -divergence. Some useful special cases of are:
[TABLE]
[TABLE]
[TABLE]
Local -divergence projection
Suppose that the distribution we approximate can be fully factorized. That is, , where denotes the -th element of the vector . , where denotes the transpose. Depending on , it is intractable to naively compute the -divergence. To avoid this problem, we focus on each single and then optimize each by
[TABLE]
where represents all but -th entry of and . Its update formula can be obtained by taking the derivative of the -divergence and equating it to zero:
[TABLE]
Since the expectation over is computationally intractable, we approximate it by first splitting the numerator as
[TABLE]
We then substitute with to obtain the following approximation:
[TABLE]
This is the method we will use in the derivation of a stochastic divergence minimization algorithm to approximate the -divergence.
3 Proposed Method
In this section, we propose a novel SDM inference algorithm for BTM. We will first show the derivation of SDM. Then we will show its relation to the leave-one-out likelihood (LOO).
3.1 Derivation of Divergence Minimization
We assume the independence between latent topics of the given biterms as
[TABLE]
We then estimate this distribution by -divergence minimization:
[TABLE]
where denotes the optimized distribution.
Since it is intractable to compute this minimization, we consider the following local divergence minimization. First, noting that
[TABLE]
where the notations are the same as those in Eq. (2) for CGS.
We then reparameterize as follows:
[TABLE]
[TABLE]
[TABLE]
[TABLE]
Notice that , and here are not counts. They are just parameters of the function defined above.
We also define
[TABLE]
[TABLE]
[TABLE]
[TABLE]
Recalling that the -divergence does not need the distributions to be normalized, we can define the following local projections:
[TABLE]
[TABLE]
[TABLE]
[TABLE]
where
[TABLE]
[TABLE]
[TABLE]
[TABLE]
Taking the derivative of Eq. (34) with respect to and equating it to zero yields
[TABLE]
With , we can obtain
[TABLE]
Similarly, we can derive the solutions to the other optimization problems listed above as follows:
[TABLE]
[TABLE]
If we use -divergence projection with for and , while using it with for , we can obtain the update formula for as
[TABLE]
which can also be obtained by SCVB0.
3.2 Relation to LOO Likelihood
Here, we investigate the relationship between divergence minimization and the LOO likelihood.
For the LOO prediction of a new biterm , its probability is given by
[TABLE]
On the other hand, using Eq. (46), we get
[TABLE]
This full likelihood is similar to the above LOO likelihood and can be regarded as an approximation to it. Considering the close relation between the LOO likelihood and the full likelihood, lowering the full likelihood indicated by can result in a lower likelihood of the data set. The close correlation between the LOO perplexity and test perplexity of LDA has been shown with detailed experimental results in Sato and Nakagawa (2015).
3.3 Derivation of SDM
For the three terms we defined in -divergence minimization, we have . Therefore, can be restored from and we need only to compute the values of . can be calculated as , so we can update it by . For this reason, we will focus on the stochastic approximation for the term .
Recall that . We can rewrite it in the form of fixed point iteration with step size :
[TABLE]
We can then replace the term with approximation , which is defined by
[TABLE]
where is a random sample with also containing the word . We can then substitute it into the update formula:
[TABLE]
This update formula is actually a realization of update based on word vocabulary. Furthermore, we process all the biterms in an one-pass fashion, which means that each biterm is processed only once one by one until the end. For this reason, we focus on the perspective of word type update and reformulate the update as
[TABLE]
where represents how many times the word has been updated so far and . In the context of stochastic approximation, we can simply make .
The detailed procedure is described in Alg. 4.
4 Experiments
We evaluate the effect of our proposed algorithm compared with existing inference algorithms.
4.1 Experimental Settings
We used the data set called Tweets2011 (http://trec.nist.gov/data/tweets) to evaluate the algorithms. Tweets2011 is a standard collection of tweets published between January 23rd and February 8th, 2011.
The raw data of Tweets2011 is very noisy and contains tweets in multiple languages. Many of the languages are difficult to perform morphological analysis such as Japanese. Therefore, in the preprocessing stage, we applied a filter to keep only English tweets. Then, we implemented essential preprocessing tasks such as stop word removal and punctuation removal. Finally, we removed documents consisting of only a single word since they can not form biterms and have no co-occurrence between different words.
After preprocessing, as we can see from the document length distribution shown in Fig. 2, most of the documents have length less than 10. There are around million documents and the average document length is words.
In most real world situations, the size of a given data set is too large to run a batch algorithm. Therefore, to better simulate practical circumstances, we considered the four algorithms, namely, the proposed SDM algorithm, the SCVB0 algorithm for BTM, the online BTM algorithm, and the incremental BTM algorithm, that can scale up to large data sets.
In order to show the convergence speed of each algorithm, and to compare all of the above algorithms at the same scale, we let the programs output the result together with the proportion of processed biterms from the beginning of each experiment. All of the experiments are conducted on an Ubuntu server with 2.9 GHz Intel Xeon E5-2667 CPU and 64 GB memory.
After the preprocessing, we obtained million short documents. We then formulated around million biterms from these short documents. We shuffled all of the biterms and divided them into two data sets: the training set and the test set. Their sizes are set to be around . It turns out that the training set contains about million biterms, and the test set contains about million biterms.
For three existing methods, hyperparameters are set in the same way as the original paper (Cheng et al, 2014). To assure both the CGS algorithms finish at a realistic time, we set the length of the rejuvenation sequence to for the incremental BTM algorithm and the inner iteration times to for the online BTM algorithm. For all of the algorithms, we set to be and to be . As step size hyperparameters, and are used for SCVB0, while is used for SDM.
Cheng et al (2014) chose to evaluate the coherence of observed topics to measure the performance of each algorithm, which requires us to use external data from sources such as Wikipedia for evaluating the pointwise mutual information. This result can be dependent on the size and the quality of the external data. Therefore, we decided to use the strategy called predictive sample re-use (Geisser, 1975) to evaluate the efficiency of each algorithm. This evaluation strategy simply measures the log likelihood sum of the test data set based on the model parameters calculated by each algorithm.
4.2 Evaluation
We demonstrate and discuss the performance of four inference algorithms. Each set of experiments is repeated 10 times.
In Fig. 3, the horizontal axis represents experimental settings under different topic numbers. In Fig. 4, the horizontal axis represents the percentage of documents that have been processed. In both of the figures, the vertical axis represents the mean value of the average log likelihood of the test set, which is the higher the more favorable. Their standard deviations are also shown by error bars over 10 runs. SDM-BTM represents the proposed SDM algorithm for BTM. SCVB0-BTM represents the SCVB0 algorithm for BTM. iBTM represents the incremental BTM algorithm. oBTM represents the online BTM algorithm.
The comparison of the update cost, memory and purpose between different algorithms is shown in Tab. 1, where denotes the length of the rejuvenation sequence and denotes the size of a biterm mini-batch.
As shown in Fig. 3 and Fig. 4, compared to online algorithms based on CGS, SDM-BTM shows a higher convergence speed and a better convergence result. This is because SDM-BTM is a deterministic algorithm while CGS algorithms are based on sampling. Compared to SCVB0-BTM, SDM-BTM performs better on a real world data set because SDM-BTM preserves the sufficient statistics correctly and is not exposed to the risk of arithmetic underflow.
5 Conclusions
In this paper, we first reviewed the BTM and its existing inference algorithms. We then reconstructed CVB0 inference for BTM and proposed a novel SDM inference algorithm, which is a stochastic inference algorithm that can be applied to practical circumstances. It outperformed existing methods in our experiments.
For future work, it would be interesting and essential to explore the relationship between the number of topics and the performance. Developing new inference algorithms based on other -divergences or conducting experiments on various data sets and studying their results would also be important challenges.
Appendix: Convergence Proof
In this appendix, we will show that the proposed SDM algorithm does converge to the minimization of the divergence. This is not obvious since the term used in Eq. (51) is not a gradient of the divergence in Eq. (35) or Eq. (36).
We will proof the convergence of the stochastic approximation by Eq. (51) for the optimization problem in Eq. (35). The proof for Eq. (36) is similar, thus it would be omitted. The whole process is similar to but different from the appendix of Sato and Nakagawa (2015).
First, we repeat the definition of the stochastic optimization problem based on Martingale convergence theory. The optimization problem is to find and the update formula is . We also let to be the history of the variable sequence, which is also called filtration :
[TABLE]
Then we have the Martingale theory as follows.
Theorem 5.1
Assume step size , function and stochastic search direction satisfy following four conditions.
Step size is a non-negative scalar and satisfies
[TABLE] 2. 2.
Function is continuously differentiable and there exists come constant such that
[TABLE] 3. 3.
There exists a positive constant such that
[TABLE] 4. 4.
There exist positive constants and such that
[TABLE]
Then the update equation satisfies the three holds with probability one.
The sequence converges. 2. 2.
. 3. 3.
Every limit point of is a stationary point of .
The detailed proof of above theory can be found in the super-Martingale convergence theorem (Bertsekas and Tsitsiklis, 2000).
First, we reform the stochastic update formula. Given the objective function
[TABLE]
We take its derivatives regarding to :
[TABLE]
Next, we define the stochastic direction
[TABLE]
where and is another biterm that contains the word . Then we rewrite the stochastic direction as
[TABLE]
The step size we actually use satisfies the first condition of the theorem 5.1 and it is not hard to show that the objective function of minimization satisfies the second condition. Therefore, what left is to show the satisfaction of the third and forth condition. From the paper that introduced stochastic divergence minimization for LDA (Sato and Nakagawa, 2015), the proof of following lemma can be found.
Lemma 1
If the stochastic noise term satisfies the following conditions, then the stochastic direction satisfies the third and forth condition of the theorem 5.1.
* is a Maringale difference sequence with respect to filtration , which means that .* 2. 2.
* has bounded variance. For example, it is square integrable with*
[TABLE]
for some constant .
Therefore, what left is to show the following lemma.
Lemma 2
The noise term satisfies the two conditions listed in the Lemma 1.
Proof
For the first condition, we show that
[TABLE]
Therefore the first condition is satisfied. Recall Eq. (59):
[TABLE]
We then define stochastic gradient as
[TABLE]
We confirm that
[TABLE]
The difference between the stochastic gradient and the real gradient is
[TABLE]
thus there exists a constant so that
[TABLE]
Thus
[TABLE]
We then introduce which is given by
[TABLE]
Therefore, there exists another constant such that
[TABLE]
for example,
[TABLE]
Therefore, we can say
[TABLE]
which satisfies the second condition. ∎
To conclude, the convergence of our stochastic approximation is proved.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Al Sumait et al (2008) Al Sumait L, Barbará D, Domeniconi C (2008) On-line lda: Adaptive topic models for mining text streams with applications to topic detection and tracking. pp 3–12
- 2Amari (1990) Amari Si (1990) Differential-geometrical methods in statistics. Lecture notes in statistics, Springer-Verlag, Berlin, Heidelberg
- 3Asuncion et al (2009) Asuncion A, Welling M, Smyth P, Teh YW (2009) On smoothing and inference for topic models. In: Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence, AUAI Press, Arlington, Virginia, United States, UAI ’09, pp 27–34, URL http://dl.acm.org/citation.cfm?id=1795114.1795118
- 4Awaya et al (2016) Awaya N, Kitazono J, Omori T, Ozawa S (2016) Stochastic collapsed variational bayesian inference for biterm topic model. In: 2016 International Joint Conference on Neural Networks (IJCNN), pp 3364–3370
- 5Beal (2003) Beal MJ (2003) Variational algorithms for approximate bayesian inference
- 6Bertsekas and Tsitsiklis (2000) Bertsekas DP, Tsitsiklis JN (2000) Gradient convergence in gradient methods with errors. SIAM Journal on Optimization 10(3):627–642
- 7Blei et al (2003) Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. J Mach Learn Res 3:993–1022
- 8Canini et al (2009) Canini KR, Shi L, Griths TL (2009) Online inference of topics with latent dirichlet allocation. In: Proceedings of AI Stats
