ZenLDA: An Efficient and Scalable Topic Model Training System on   Distributed Data-Parallel Platform

Bo Zhao; Hucheng Zhou; Guoqiang Li; Yihua Huang

arXiv:1511.00440·cs.DC·November 23, 2015·1 cites

ZenLDA: An Efficient and Scalable Topic Model Training System on Distributed Data-Parallel Platform

Bo Zhao, Hucheng Zhou, Guoqiang Li, Yihua Huang

PDF

Open Access

TL;DR

zenLDA is a scalable, efficient system for large-scale LDA training on distributed platforms, combining novel algorithms and system optimizations to handle billions of documents and trillions of parameters.

Contribution

It introduces a new CGS algorithm and graph-based parallelization approach, enhancing scalability and accuracy for large-scale topic modeling.

Findings

01

Achieves significantly better performance than existing CGS algorithms.

02

Maintains higher model accuracy with approximations like sparse initialization.

03

Effectively handles web-scale corpora with billions of documents.

Abstract

This paper presents our recent efforts, zenLDA, an efficient and scalable Collapsed Gibbs Sampling system for Latent Dirichlet Allocation training, which is thought to be challenging that both data parallelism and model parallelism are required because of the Big sampling data with up to billions of documents and Big model size with up to trillions of parameters. zenLDA combines both algorithm level improvements and system level optimizations. It first presents a novel CGS algorithm that balances the time complexity, model accuracy and parallelization flexibility. The input corpus in zenLDA is represented as a directed graph and model parameters are annotated as the corresponding vertex attributes. The distributed training is parallelized by partitioning the graph that in each iteration it first applies CGS step for all partitions in parallel, followed by synchronizing the computed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGraph Theory and Algorithms · Scientific Computing and Data Management · Data Management and Algorithms