TL;DR
This paper introduces a doubly sparse data-parallel sampler for hierarchical Dirichlet process (HDP) topic models, enabling efficient large-scale training on big text corpora using parallel computing.
Contribution
It presents a novel sparse parallel sampling method for HDP topic models that leverages natural language sparsity to improve scalability and efficiency.
Findings
Successfully trained on 8 million documents in under four days
Achieved efficient parallelization by exploiting data sparsity
Demonstrated scalability on large-scale text data
Abstract
To scale non-parametric extensions of probabilistic topic models such as Latent Dirichlet allocation to larger data sets, practitioners rely increasingly on parallel and distributed systems. In this work, we study data-parallel training for the hierarchical Dirichlet process (HDP) topic model. Based upon a representation of certain conditional distributions within an HDP, we propose a doubly sparse data-parallel sampler for the HDP topic model. This sampler utilizes all available sources of sparsity found in natural language - an important way to make computation efficient. We benchmark our method on a well-known corpus (PubMed) with 8m documents and 768m tokens, using a single multi-core machine in under four days.
| Symbol | Description | Symbol | Description | |
|---|---|---|---|---|
| Vocabulary size | Global distribution over topics | |||
| Total number of documents | Document-topic probabilities | |||
| Total number of tokens | Topic probabilities for document | |||
| Word type for token | Document-topic sufficient statistic | |||
| Document for token | Topic-word probabilities | |||
| Token in document | Word probabilities for topic | |||
| Global topic draw indicator for | Topic-word sufficient statistic | |||
| Topic indicator for token in | Global topic latent sufficient statistic | |||
| Index for implicitly-represented topics | Prior concentration for , , |
| Corpus | Iterations | Threads | Runtime | ||||
|---|---|---|---|---|---|---|---|
| AP | 7 074 | 2 206 | 393 567 | 100 000 | 8 | 3.8 hours | |
| CGCBIB | 6 079 | 5 940 | 570 370 | 100 000 | 12 | 2.7 hours | |
| NeurIPS | 12 419 | 1 499 | 1 894 051 | 255 500 | 8 | 24 hours | |
| PubMed | 89 987 | 8 199 999 | 768 434 972 | 25 000 | 20 | 82.4 hours |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Sparse Parallel Training of Hierarchical Dirichlet Process Topic Models
Alexander Terenin
Imperial College London &Måns Magnusson
Uppsala University
and Aalto University &Leif Jonsson
Ericsson AB
and Linköping University
Abstract
To scale non-parametric extensions of probabilistic topic models such as Latent Dirichlet allocation to larger data sets, practitioners rely increasingly on parallel and distributed systems. In this work, we study data-parallel training for the hierarchical Dirichlet process (HDP) topic model. Based upon a representation of certain conditional distributions within an HDP, we propose a doubly sparse data-parallel sampler for the HDP topic model. This sampler utilizes all available sources of sparsity found in natural language—an important way to make computation efficient. We benchmark our method on a well-known corpus (PubMed) with 8m documents and 768m tokens, using a single multi-core machine in under four days.
1 Introduction
Topic models are a widely-used class of methods that allow practitioners to identify latent semantic themes in large bodies of text in an unsupervised manner. They are particularly attractive in areas such as history (Yang et al., 2011; Wang et al., 2012), sociology (DiMaggio et al., 2013), and political science (Roberts et al., 2014), where a desire for careful control of structure and prior information incorporated into the model motivates one to adopt a Bayesian approach to learning. In these areas, large corpora such as newspaper archives are becoming increasingly available (Ehrmann et al., 2020), and models such as latent Dirichlet allocation (LDA) (Blei et al., 2003) and its nonparametric extensions (Teh et al., 2006; Teh, 2006; Hu and Boyd-Graber, 2012; Paisley et al., 2015) are widely used by practitioners. Moreover, these models are emerging as a component of data-efficient language models (Guo et al., 2020). Training topic models efficiently entails two requirements.
Expose sufficient parallelism that can be taken advantage of by the hardware. 2. 2.
Utilize sparsity found in natural language to control memory requirements and computational complexity.
In this work, we focus on the hierarchical Dirichlet process (HDP) topic model of Teh et al. (2006), which we review in Section 2. This model is a simple non-trivial extension of LDA to the nonparametric setting. This parallel implementation provides a blueprint for designing massively parallel training algorithms in more complicated settings, such as nonparametric dynamic topic models (Ahmed and Xing, 2010) and tree-based extensions (Hu and Boyd-Graber, 2012).
Parallel approaches to training HDPs have been previously introduced by a number of authors, including Newman et al. (2009), Wang et al. (2011), Williamson et al. (2013), Chang and Fisher (2014) and Ge et al. (2015). These techniques suit various settings: some are designed to explicitly incorporate sparsity present in natural language and other discrete spaces, while others are intended for HDP-based continuous mixture models. Gal and Ghahramani (2014) have pointed out that some methods can suffer from load-balancing issues, which limit their parallelism and scalability. The largest benchmark of parallel HDP training performed to our awareness is by Chang and Fisher (2014) on the 100m-token NYTimes corpora. Throughout this work, we focus on Markov chain Monte Carlo (MCMC) methods—empirically, their scalability is comparable to variational methods (Magnusson et al., 2018; Hoffman and Ma, 2019), and, subject to convergence, they yield the correct posterior.
Our contributions are as follows. We propose an augmented representation of the HDP for which the topic indicators can be sampled in parallel over documents. We prove that, under this representation, the global topic distribution is conditionally conjugate given an auxiliary parameter . We develop fast sampling schemes for and , and propose a training algorithm with a per-iteration complexity that depends on the minima of two sparsity terms—it takes advantage of both document-topic and topic-word sparsity simultaneously.
2 Partially collapsed Gibbs sampling for hierarchical Dirichlet processes
The hierarchical Dirichlet process topic model (Teh et al., 2006) begins with a global distribution over topics. Documents are assumed exchangeable—for each document , the associated topic distribution follows a Dirichlet process centered at . Each topic is associated with a distribution of tokens . Within each document, tokens are assumed exchangeable (bag of words) and assigned to topic indicators . For given data, we observe the tokens .
We thus arrive at the GEM representation of a HDP, given by equation (19) of Teh et al. (2006) as
[TABLE]
where are prior hyperparameters.
2.1 Intuition and augmented representation
At a high level, our strategy for constructing a scalable sampler is as follows. Conditional on , the likelihood in equations (1)–(5) is the same as that of LDA. Using this observation, the Gibbs step for , which is the largest component of the model, can be handled efficiently by leveraging insights on sparse parallel sampling from the well-studied LDA literature (Yao et al., 2009; Li et al., 2014; Magnusson et al., 2018; Terenin et al., 2019). For this strategy to succeed, we need to ensure that all Gibbs steps involved in the HDP under this representation are analytically tractable and can be computed efficiently. For this, the representation needs to be modified.
To begin, we integrate each out of the model, which by conjugacy (Blackwell and MacQueen, 1973) yields a Pólya sequence for each . By definition, given in Appendix A, this sequence is a mixture distribution with respect to a set of Bernoulli random variables , each representing whether was drawn from or from a repeated draw in the Pólya urn. Thus, the HDP can be written
[TABLE]
where is a Pólya sequence, defined in Appendix A. This representation defines a posterior distribution over for the HDP. To derive a Gibbs sampler, we calculate its full conditionals.
2.2 Full conditionals for , , and
The full conditionals and , with marginalized out, are essentially those in partially collapsed LDA (Magnusson et al., 2018; Terenin et al., 2019). They are
[TABLE]
where is the word type for word token , and
[TABLE]
where denotes the document-topic sufficient statistic with index removed, and is the topic-word sufficient statistic. Note the number of possible topics and full conditionals here is countably infinite. The full conditional for each is
[TABLE]
The derivation, based on a direct application of Bayes’ Rule with respect to the probability mass function of the Pólya sequence, is in Appendix A.
2.3 The full conditional for
To derive the full conditional for , we examine the prior and likelihood components of the model. It is shown in Appendix A that the likelihood term may be written
[TABLE]
The first term is a multiplicative constant independent of and vanishes via normalization. Thus, the full conditional depends on and only through the sufficient statistic defined by
[TABLE]
and so we may suppose without loss of generality that the likelihood term is categorical. Under these conditions, we prove the full conditional for admits a stick-breaking representation.
Proposition 1**.**
Without loss of generality, suppose
[TABLE]
Then is given by
[TABLE]
where are the empirical counts of .
Proof.
Appendix B. ∎
This expression is similar to the stick-breaking representation of a Dirichlet process —however, it has different weights and does not include random atoms drawn from as part of its definition—see Appendix B for more details. Putting these ideas together, we define an infinite-dimensional parallel Gibbs sampler.
Algorithm 1**.**
Repeat until convergence.
- •
Sample in parallel over topics for .
- •
Sample in parallel over documents for .
- •
Sample according to equation (14) in parallel over documents for .
- •
Sample according to equations (19)–(20).
Algorithm 1 is completely parallel, but cannot be implemented as stated due to the infinite number of full conditionals for , as well as the infinite product used in sampling . We now bypass these issues by introducing an approximate finite-dimensional sampling scheme.
2.4 Finite-dimensional sampling of and
By way of assuming , an HDP assumes an infinite number of topics are present a priori, with the number of tokens per topic decreasing rapidly with the topic’s index in a manner controlled by . Thus, under the model, a topic with a sufficiently large index should contain no tokens with high probability.
We thus propose to approximate by projecting its tail onto a single flag topic , which stands for all topics not explicitly represented as part of the computation. This can be done by by deterministically setting in equation (19). The resulting finite-dimensional will be the correct posterior full conditional for the finite-dimensional generalized Dirichlet prior considered previously in Section 2.3. Hence, this finite-dimensional truncation forms a Bayesian model in its own right, which suggests it should perform reasonably well. From an asymptotic perspective, Ishwaran and James (2001) have shown that the approximation is almost surely convergent and, therefore, well-posed.
Once this is done, becomes a finite vector of length , and only rows of need to be explicitly instantiated as part of the computation. This instantiation allows the algorithm to be defined on a fixed finite state space, simplifying bookkeeping and implementation.
From a computational efficiency perspective, the resulting value takes the place of in partially collapsed LDA. However, it cannot be interpreted as the number of topics in the sense of LDA. Indeed, LDA implicitly assumes that deterministically—i.e., that every topic is assumed a priori to contain the same number of tokens. In contrast, the HDP model learns this distribution from the data by letting .
If we allow the state space to be resized when topic is sampled, then following Papaspiliopoulos and Roberts (2008), it is possible to develop truncation schemes which introduce no error. Since this results in more complicated bookkeeping which reduces performance, we instead fix and defer such considerations to future work. We recommend setting to be sufficiently large that it does not significantly affect the model’s behavior, which can be checked by tracking the number of tokens assigned to the topic .
2.5 Sparse sampling of and
To be efficient, a topic model needs to utilize the sparsity found in natural language as much as possible. In our case, the two main sources of sparsity are as follows.
Document-topic sparsity: most documents will only contain a handful of topics. 2. 2.
Topic-word sparsity: most word types will not be present in most topics.
We thus expect the document-topic sufficient statistic and topic-word sufficient statistic to contain many zeros. We seek to use this to reduce sampling complexity. Our starting point is the Poisson Pólya Urn sampler of Terenin et al. (2019), which presents a Gibbs sampler for LDA with computational complexity that depends on the minima of two sparsity coefficients representing document-topic and topic-word sparsity—such algorithms are termed doubly sparse. The key idea is to approximate the Dirichlet full conditional for with a Poisson Pólya Urn (PPU) distribution defined by
[TABLE]
for . This distribution is discrete, so becomes a sparse matrix. The approximation is accurate even for small values of , and Terenin et al. (2019) proves that the approximation error will vanish for large data sets in the sense of convergence in distribution.
If is uniform, we can further use sparsity to accelerate sampling . Since a sum of Poisson random variables is Poisson, we can split . We then sample sparsely by introducing a Poisson process and sampling its points uniformly, and sample sparsely by iterating over nonzero entries of .
For , the full conditional
[TABLE]
is similar to to the one in partially collapsed LDA (Magnusson et al., 2018)—the difference is the presence of . As only enters the expression through component and is identical for all , it can be absorbed at each iteration directly into an alias table (Walker, 1977; Li et al., 2014). Component can be computed efficiently by utilizing sparsity of and and iterating over whichever has fewer non-zero entries.
2.6 Direct sampling of
Rather than sampling , whose size will grow linearly with the number of documents, we introduce a scheme for sampling the sufficient statistic directly. Observe that
[TABLE]
where the domain of summation and the value of the indicators have been switched. By definition of , we have
[TABLE]
where
[TABLE]
Summing this expression over documents, we obtain the expression
[TABLE]
where is the total number of documents with . Since for all topics without any tokens assigned, we only need to sample for topics that have tokens assigned to them. This idea can also be straightforwardly applied to other HDP samplers (Chang and Fisher, 2014; Ge et al., 2015), by allowing one to derive alternative full conditionals in lieu of the Stirling distribution (Antoniak, 1974). The complexity of sampling directly is constant with respect to the number of documents, and depends instead on the maximum number of tokens per document.
To handle the bookkeeping necessary for computing , we introduce a sparse matrix of size whose entries are the number of documents for topic that have a total of topic indicators assigned to them. We increment once been sampled by iterating over non-zero elements in . We then compute as the reverse cumulative sum of the rows of .
2.7 Poisson Pólya urn partially collapsed Gibbs sampling
Putting all of these ideas together, we obtain the following algorithm.
Algorithm 2**.**
Repeat until convergence.
- •
Sample in parallel over topics for .
- •
Sample in parallel over documents for .
- •
Sample according to equation (28) in parallel over topics for .
- •
Sample according to equations (19)–(20), except with .
Algorithm 2 is sparse, massively parallel, defined on a fixed finite state space, and contains no infinite computations in any of its steps. The Gibbs step for converges in distribution (Terenin et al., 2019) to the true Gibbs steps as , and the Gibbs step for converges almost surely (Ishwaran and James, 2001) to the true Gibbs step as .
2.8 Computational complexity
We now examine the per-iteration computational complexity of Algorithm 2. To proceed, we fix and maximum document size , and relate the vocabulary size with the number of total words as follows.
Assumption** (Heaps’ Law).**
The number of unique words in a corpus follows Heaps’ law (Heaps, 1978) with constants and .
The per-iteration complexity of Algorithm 2 is equal to the sum of the per-iteration complexity of sampling its components. The sampling complexities of and are constant with respect to the number of tokens, and the sampling complexity of has been shown by Magnusson et al. (2018) to be negligible under the given assumptions. Thus, it suffices to consider .
At a given iteration, let be the number of existing topics in document associated with word token , and let be the number of nonzero topics in the row of corresponding to word token . It follows immediately from the argument given by Terenin et al. (2019) that the per-iteration complexity of sampling each topic indicator is
[TABLE]
Algorithm 2 is thus a doubly sparse algorithm.
3 Performance results
To study performance of the partially collapsed sampler—Algorithm 2—we implemented it in Java using the open-source Mallet11footnotemark: 1 (McCallum, 2002) topic modeling framework. We ran it on the AP, CGCBIB, NeurIPS, and PubMed corpora,11footnotemark: 1 which are summarized in Table 2. Prior hyperparameters controlling the degree of sparsity were set to . We set and observed no tokens ever allocated to the topic . Data were preprocessed with default Mallet (McCallum, 2002) stop-word removal, minimum document size of 10, and a rare word limit of 10. Following Teh et al. (2006), the algorithm was initialized with one topic. All experiments were repeated five times to assess variability. Total runtime for each experiment is given in Table 2.
To assess Algorithm 2 in a small-scale setting, we compare it to the widely-studied sparse fully collapsed direct assignment sampler of Teh et al. (2006), which is not parallel. We ran 100 000 iterations of both methods on AP and CGCBIB. We selected these corpora because they were among the larger corpora on which it was feasible to run our direct assignment reference implementation within one week.
Trace plots for the log marginal likelihood for given and the number of active topics, i.e., those topics assigned at least one token, can be seen in Figure 1(a,d) and Figure 1(b,e), respectively. The direct assignment algorithm converges slower, but achieves a slightly better local optimum in terms of marginal log-likelihood, compared to our method. This fact indicates that the direct assignment method may stabilize around a different local optimum, and may represent a potential limitation of the partially collapsed sampler in settings where non-parallel methods are practical.
To better understand the distributional differences between the algorithms, we examined the number of tokens per topic, which can be seen in Figure 1(c,f). The partially collapsed sampler is seen to assign more tokens to smaller topics, indicating that it stabilizes around a local optimum with slightly broader semantic themes.
To visualize the effect this has on the topics, we examined the most common words for each topic. Since the algorithms generate too many topics to make full examination practical, we instead compute a quantile summary with five topics per quantile. The quantile is computed by ranking all topics by the number of tokens, choosing the five closest topics to the , , , , and quantiles in the ranking, and computing their top words. This approach gives a representative view of the algorithm’s output for large, medium, and small topics. Results may be seen in Appendix D and Appendix C—we find the direct assignment and partially collapsed samplers to be mostly comparable, with substantial overlap in top words for common topics.
Next, we assess Algorithm 2 in a more demanding setting and compare against previous parallel state-of-the-art. There are various scalable samplers available for the HDP. For a fair comparison, we restrict ourselves to those samplers designed for topic models and explicitly incorporate sparsity of natural language in their construction. Among these, we selected the parallel subcluster split-merge algorithm of Chang and Fisher (2014) as our baseline because it was used in the largest-scale benchmark of the HDP topic model performed to date to our awareness, and shows comparable performance to other methods (Ge et al., 2015). The subcluster split-merge algorithm is designed to converge with fewer iterations, but is more costly to run per iteration. Thus, we used a fixed computational budget of 24 hours of wall-clock time for both algorithms. Computation was performed on a system with a 4-core 8-thread CPU and 8GB RAM.
Results can be seen in Figure 1(g)—note that the subcluster split-merge algorithm is parametrized using sub-topic indicators and sub-topic probabilities, so its numerical log-likelihood values are not directly comparable to ours and should be interpreted purely to assess convergence. Algorithm 2 stabilizes much faster with respect to both the number of active topics in Figure 1(g), and marginal log-likelihood in Figure 1(h). The subcluster split-merge algorithm adds new topics one-at-a-time, whereas our algorithm can create multiple new topics per iteration—we hypothesize this difference leads to faster convergence for Algorithm 2.
In Figure 1(i), we observe that the amount of computing time per iteration increases substantially for the subcluster split-merge method as it adds more topics. For Algorithm 2, this stays approximately constant for its entire runtime.
To evaluate the topics produced by the algorithms, we again examined the most common words for each topic via a quantile summary, given in Appendix E. We find the subcluster split-merge algorithm appears to generate topics with slightly more semantic overlap compared to Algorithm 2, but otherwise produces comparable output.
Finally, to assess scalability, we ran 25 000 iterations of Algorithm 2 on PubMed, which contains 768m tokens. To our knowledge, this dataset is an order of magnitude larger than any datasets used in previous MCMC-based approaches for the HDP. Computation was performed on a compute node with 2x10-core CPUs with 20 threads and 64GB of RAM. The marginal likelihood and number of active topics are given in Figure 1(j) and Figure 1(k).
To evaluate the topics discovered by the algorithm, we examined their most common words—these may be seen in full in Appendix F. We observe that the semantic themes present in the topics vary according to how many tokens they have: topics with more tokens appear to be broader, whereas topics with fewer tokens appear to be more specific. This behavior illustrates a key difference between the HDP and methods like LDA, which do not contain a learned global topic distribution in their formulation. We suspect the effect is particularly pronounced on PubMed compared to CGCBIB and NeurIPS due to its large number of tokens.
4 Discussion
In this work, we introduce the parallel partially collapsed Gibbs sampler—Algorithm 1—for the HDP topic model, which converges to the correct target distribution. We propose a doubly sparse approximate sampler—Algorithm 2—which allows the HDP to be implemented with per-token sampling complexity of \smash{\mathcal{O}\big{[}\min\big{(}K^{(\mathbf{m})}_{d(i)},K^{(\mathbf{\Phi})}_{v(i)}\big{)}\big{]}} which is the same as that of Pólya Urn LDA (Terenin et al., 2019). Compared to other approaches for the HDP, it offers the following improvements.
The algorithm is fully parallel in all steps. 2. 2.
The topic indicators utilize all available sources of sparsity to accelerate sampling. 3. 3.
All steps not involving have constant complexity with respect to data size. 4. 4.
The proposed sparse approximate algorithm becomes exact as and .
These improvements allow us to train the HDP on larger corpora. The data-parallel nature of our approach means that the amount of available parallelism increases with data size. This parallelism avoids load-balancing-related scalability limitations pointed out by Gal and Ghahramani (2014).
Nonparametric topic models are less straightforward to evaluate empirically than ordinary topic models. In particular, we found topic coherence scores (Mimno et al., 2011) to be strongly affected by the number of active topics , which causes preference for models with fewer topics and more semantic overlap per topic. We view the development of summary statistics that are -agnostic and those measuring other aspects of topic quality such as overlap, to be an important direction for future work. We are particularly interested in techniques that can be used to compare algorithms for sampling from the same model defined over fully disjoint state spaces, such as Algorithm 2 and the subcluster split-merge algorithm in Section 3.
Partially collapsed HDP can stabilize around a different local mode than fully collapsed HDP as proposed by Teh et al. (2006). There have been attempts to improve mixing in that sampler (Chang and Fisher, 2014), including the use of Metropolis-Hastings steps for jumping between modes (Jain and Neal, 2004). These techniques are largely complementary to ours and can be explored in combination with the ideas presented here.
The HDP posterior is a heavily multimodal target for which full posterior exploration is known to be difficult (Chang and Fisher, 2014; Gal and Ghahramani, 2014; Buntine and Mishra, 2014), and sampling schemes are generally used more in the spirit of optimization than traditional MCMC. These issues are mirrored in other approaches, such as variational inference. There, restrictive mean-field factorization assumptions are often required, which reduces the quality of discovered topics. We view MAP-based analogs of ideas presented here as a promising direction, since these may allow additional flexibility that may enable faster training.
Many of the ideas in this work, such as the binomial trick, are generic and apply to any topic model structurally similar to the HDP’s GEM representation (Teh et al., 2006) given in Section 2. For example, one could consider an informative prior for in lieu of , potentially improving convergence and topic quality, or developing parallel schemes for other nonparametric topic models such as Pitman-Yor models (Teh, 2006), tree-based models (Hu and Boyd-Graber, 2012; Paisley et al., 2015), embedded topic models (Dieng et al., 2020), as well as nonparametric topic models used within data-efficient language models (Guo et al., 2020) in future work.
Conclusion
We introduce the doubly sparse partially collapsed Gibbs sampler for the hierarchical Dirichlet process topic model. By formulating this algorithm using a representation of the HDP which connects it with the well-studied Latent Dirichlet Allocation model, we obtain a parallel algorithm whose per-token sampling complexity is the minima of two sparsity terms. The ideas used apply to a large array of topic models, for example, dynamic topic models with time-varying, which possess the same full conditional for . Our algorithm for the HDP scales to a 768m-token corpus (PubMed) on a single multicore machine in under four days.
The proposed techniques leverage parallelism and sparsity to scale nonparametric topic models to larger datasets than previously considered feasible for MCMC or other methods possessing similar convergence properties. We hope these contributions enable wider use of Bayesian nonparametrics for large collections of text.
Acknowledgments
The research was funded by the Academy of Finland (grants 298742, 313122), as well as the Swedish Research Council (grants 201805170, 201806063). Computations were performed using compute resources within the Aalto University School of Science and Department of Computing at Imperial College London. We also acknowledge the support of Ericsson AB.
Appendix A Appendix: sufficiency of and full conditional for
Recall that the one-step-ahead conditional probability mass function in a Pólya sequence taking values in with concentration parameter and base probability mass function is
[TABLE]
Introducing the random variable
[TABLE]
we can express the one-step-ahead conditional distribution as
[TABLE]
The joint probability mass function for is then
[TABLE]
Note that and vice versa. Thus each term in the product for only has one component, and we may express as
[TABLE]
where we have re-expressed the probability mass function of in a form that emphasizes conjugacy. Thus for any prior, the posterior will only depend on the likelihood of the values of for which . The sufficient statistic is
[TABLE]
Next, for a given , we can calculate the posterior of a component as
[TABLE]
where we have divided both expressions by
[TABLE]
which is constant with respect to . Note that full conditionally, we have b_{i}\mathrel{\rotatebox[origin=c]{90.0}{\models}}b_{i^{\prime}} for . This gives the desired expressions and concludes the derivation.
Appendix B Appendix: full conditional for
Before proceeding with the derivation, we first comment on Proposition 1 and differences between the GEM distribution and Dirichlet process, which otherwise appear superficially similar. The GEM distribution is defined as
[TABLE]
On the other hand, a Dirichlet process is defined as
[TABLE]
From a Bayesian perspective, this extra stage—the presence of —prevents one from applying standard results on conjugacy of Dirichlet processes. The joint distribution of a finite set of states does not admit a closed-form expression, so we seek to derive the posterior conditional in a different way.
Rather than proving conjugacy for directly, we look for a larger finite-dimensional distribution within which sits that has better conjugacy properties. The generalized Dirichlet distribution of Connor and Mosimann (1969) fulfills this criteria. The conjugacy relationship we seek follows from the general property that conditioning and marginalization commute. This will be shown to yield the posterior
[TABLE]
For comparison, a posterior Dirichlet process is given by
[TABLE]
which shows that this relatively mild difference in the prior yields a posterior of a rather different form.
We now proceed to formally calculate this posterior distribution, starting from a GEM prior and discrete likelihood. Since we are working in a nonparametric setting, we begin by introducing the necessary formalism. We then introduce our finite-dimensional approximating prior and compute the posterior under it. For this, we use commutativity of conditioning and marginalization to deduce the full infinite-dimensional posterior.
Definition 2** (Preliminaries).**
Let be a probability space. Let be the space of signed measures, equipped with the topology of weak convergence. Let be the space of probability measures over , and identify with the probability simplex by the homeomorphism . Let , let , and let be its empirical counts, defined by where is equal to for coordinate and [math] for all other coordinate. Let . Recall that and , endowed with the discrete topology and topology of weak convergence, respectively, are both Polish spaces—hence, the Disintegration Theorem (Ambrosio et al. (2005), Theorem 5.3.1; Bogachev (2007), Corollary 10.4.15) holds in both spaces. We associate each random variable with its pushforward probability measure , and each conditional random variables with its pushforward regular conditional probability measure , where the preimage is taken with respect to .
Definition 3** (Discrete likelihood).**
For all , define the conditional random variable by its probability mass function
[TABLE]
We say .
Definition 4** (GEM).**
Let be a random variable defined by
[TABLE]
We say .
Definition 5** (Finite GEM).**
Let be a random variable defined by
[TABLE]
We say .
Definition 6** (Posterior).**
Let be the unique conditional random variable given by the Disintegration Theorem, where uniqueness follows from almost sure uniqueness by virtue of the marginal measure being absolutely continuous with respect to the counting measure on , which has no non-empty null sets.
Result 7**.**
Let . Let , and let . Let . Then for any with empirical counts , we have that is a conditional random variable defined by
[TABLE]
where
[TABLE]
Proof.
It is shown by Connor and Mosimann (1969) that is a special case of the generalized Dirichlet distribution, which admits a general stick-breaking representation. Thus, its probability density function is
[TABLE]
which we have expressed in a simplified form. By conjugacy, for a given and associated the posterior probability density is
[TABLE]
which is again a generalized Dirichlet admitting the necessary stick-breaking representation, which we have expressed in a form that emphasizes its posterior hyperparameters. ∎
Remark 8**.**
It is now clear that the assumption is indeed taken without loss of generality, because if we instead took to be given by a Pólya sequence, then by sufficiency the prior-to-posterior map would be identical.
Proposition 1**.**
Without loss of generality, suppose
[TABLE]
Then is given by
[TABLE]
where are the empirical counts of .
Proof.
Let be an arbitrary finite index set, and let be the finite-dimensional marginal projection of onto the coordinates contained in . Let , let be the posterior conditional random variable under , and let be the marginal consisting of those coordinates contained in . By construction, equals in distribution. Since by the Disintegration Theorem, conditioning and marginalization commute, the set is arbitrary, and is uniquely determined by its finite-dimensional marginal projections, the claim follows. ∎
Appendix C Appendix: quantile summary of topics for AP
Here we display a multi-quantile summary for AP, obtained by ranking all topics with at least 100 tokens by their total number of tokens, computing the , , , , and quantiles. We compute the five topics closest to each quantile by number of tokens, and display their top-eight words.
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
Appendix D Appendix: quantile summary of topics for CGCBIB
Here we display a multi-quantile summary for CGCBIB, obtained by ranking all topics with at least 100 tokens by their total number of tokens, computing the , , , , and quantiles. We compute the five topics closest to each quantile by number of tokens, and display their top-eight words.
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
Appendix E Appendix: quantile summary of topics for NeurIPS
Here we display a multi-quantile summary for NeurIPS, obtained by ranking all topics with at least 100 tokens by their total number of tokens, computing the , , , , and quantiles. We compute the five topics closest to each quantile by number of tokens, and display their top-eight words.
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
Appendix F Appendix: topics produced by Algorithm 2 on PubMed
Here we show top eight words for each topic together with total number of tokens assigned, which is shown at the top of each table. We display all topics containing at least eight unique word tokens.
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
[TABLE]
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Ahmed and Xing (2010) Amr Ahmed and Eric P. Xing. 2010. Timeline: a dynamic hierarchical Dirichlet process model for recovering birth/death and evolution of topics in text stream. In Uncertainty in Artificial Intelligence , pages 20–29.
- 2Ambrosio et al. (2005) Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. 2005. Gradient Flows in Metric Spaces and in the Space of Probability Measures . Birkhäuser.
- 3Antoniak (1974) Charles E. Antoniak. 1974. Mixtures of Dirichlet processes with applications to Bayesian nonparametric problems. The Annals of Statistics , 2(6):1152–1174.
- 4Blackwell and Mac Queen (1973) David Blackwell and James B. Mac Queen. 1973. Ferguson distributions via Pólya urn schemes. The Annals of Statistics , 1(2):353–355.
- 5Blei et al. (2003) David M. Blei, Andrew Y. Ng, and Michael I. Jordan. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research , 3(1):993–1022.
- 6Bogachev (2007) Vladimir I. Bogachev. 2007. Measure Theory: Volume II . Springer.
- 7Buntine and Mishra (2014) Wray L. Buntine and Swapnil Mishra. 2014. Experiments with non-parametric topic models. In Knowledge Discovery and Data Mining , pages 881–890.
- 8Chang and Fisher (2014) Jason Chang and John W. Fisher, III. 2014. Parallel sampling of HD Ps using sub-cluster splits. In Advances in Neural Information Processing Systems , pages 235–243.
