# Analyses of Multi-collection Corpora via Compound Topic Modeling

**Authors:** Clint P. George, Wei Xia, George Michailidis

arXiv: 1907.01636 · 2019-07-04

## TL;DR

This paper introduces the compound latent Dirichlet allocation (cLDA) model for analyzing multiple text collections simultaneously, capturing shared topics and their variations across subcollections, with improved inference methods and demonstrated effectiveness.

## Contribution

The paper proposes the cLDA model that extends traditional topic models to multiple collections, incorporating prior knowledge and offering efficient inference techniques.

## Key findings

- cLDA effectively captures shared and varying topics across collections
- The model outperforms existing methods in qualitative and quantitative evaluations
- Efficient MCMC method improves parameter estimation

## Abstract

As electronically stored data grow in daily life, obtaining novel and relevant information becomes challenging in text mining. Thus people have sought statistical methods based on term frequency, matrix algebra, or topic modeling for text mining. Popular topic models have centered on one single text collection, which is deficient for comparative text analyses. We consider a setting where one can partition the corpus into subcollections. Each subcollection shares a common set of topics, but there exists relative variation in topic proportions among collections. Including any prior knowledge about the corpus (e.g. organization structure), we propose the compound latent Dirichlet allocation (cLDA) model, improving on previous work, encouraging generalizability, and depending less on user-input parameters. To identify the parameters of interest in cLDA, we study Markov chain Monte Carlo (MCMC) and variational inference approaches extensively, and suggest an efficient MCMC method. We evaluate cLDA qualitatively and quantitatively using both synthetic and real-world corpora. The usability study on some real-world corpora illustrates the superiority of cLDA to explore the underlying topics automatically but also model their connections and variations across multiple collections.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1907.01636/full.md

## Figures

64 figures with captions in the complete paper: https://tomesphere.com/paper/1907.01636/full.md

## References

39 references — full list in the complete paper: https://tomesphere.com/paper/1907.01636/full.md

---
Source: https://tomesphere.com/paper/1907.01636