# Sparse Parallel Training of Hierarchical Dirichlet Process Topic Models

**Authors:** Alexander Terenin, M{\aa}ns Magnusson, Leif Jonsson

arXiv: 1906.02416 · 2020-10-07

## TL;DR

This paper introduces a doubly sparse data-parallel sampler for hierarchical Dirichlet process (HDP) topic models, enabling efficient large-scale training on big text corpora using parallel computing.

## Contribution

It presents a novel sparse parallel sampling method for HDP topic models that leverages natural language sparsity to improve scalability and efficiency.

## Key findings

- Successfully trained on 8 million documents in under four days
- Achieved efficient parallelization by exploiting data sparsity
- Demonstrated scalability on large-scale text data

## Abstract

To scale non-parametric extensions of probabilistic topic models such as Latent Dirichlet allocation to larger data sets, practitioners rely increasingly on parallel and distributed systems. In this work, we study data-parallel training for the hierarchical Dirichlet process (HDP) topic model. Based upon a representation of certain conditional distributions within an HDP, we propose a doubly sparse data-parallel sampler for the HDP topic model. This sampler utilizes all available sources of sparsity found in natural language - an important way to make computation efficient. We benchmark our method on a well-known corpus (PubMed) with 8m documents and 768m tokens, using a single multi-core machine in under four days.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1906.02416/full.md

## Figures

8 figures with captions in the complete paper: https://tomesphere.com/paper/1906.02416/full.md

## References

37 references — full list in the complete paper: https://tomesphere.com/paper/1906.02416/full.md

---
Source: https://tomesphere.com/paper/1906.02416