Computing Web-scale Topic Models using an Asynchronous Parameter Server

Rolf Jagerman; Carsten Eickhoff; Maarten de Rijke

arXiv:1605.07422·cs.DC·June 20, 2017

Computing Web-scale Topic Models using an Asynchronous Parameter Server

Rolf Jagerman, Carsten Eickhoff, Maarten de Rijke

PDF

1 Repo

TL;DR

This paper introduces APS-LDA, a scalable, asynchronous topic modeling framework integrated with Spark, capable of processing Web-scale datasets efficiently without sacrificing model quality.

Contribution

The paper presents a novel asynchronous parameter server for Spark-based topic modeling, enabling scalable, in-memory processing of massive datasets with ease of integration.

Findings

01

APS-LDA processes 135 times more data than existing Spark LDA implementations.

02

It handles 10 times more topics without loss of model quality.

03

The system eliminates disk writes by keeping data in memory throughout.

Abstract

Topic models such as Latent Dirichlet Allocation (LDA) have been widely used in information retrieval for tasks ranging from smoothing and feedback methods to tools for exploratory search and discovery. However, classical methods for inferring topic models do not scale up to the massive size of today's publicly available Web-scale data sets. The state-of-the-art approaches rely on custom strategies, implementations and hardware to facilitate their asynchronous, communication-intensive workloads. We present APS-LDA, which integrates state-of-the-art topic modeling with cluster computing frameworks such as Spark using a novel asynchronous parameter server. Advantages of this integration include convenient usage of existing data processing pipelines and eliminating the need for disk writes as data can be kept in memory from start to finish. Our goal is not to outperform highly customized…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

rjagerman/glint
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLinear Discriminant Analysis