textTOvec: Deep Contextualized Neural Autoregressive Topic Models of   Language with Distributed Compositional Prior

Pankaj Gupta; Yatin Chaudhary; Florian Buettner; Hinrich; Sch\"utze

arXiv:1810.03947·cs.CL·February 26, 2019

textTOvec: Deep Contextualized Neural Autoregressive Topic Models of Language with Distributed Compositional Prior

Pankaj Gupta, Yatin Chaudhary, Florian Buettner, Hinrich, Sch\"utze

PDF

1 Repo

TL;DR

This paper introduces ctx-DocNADEe, a neural autoregressive topic model that combines language modeling and external knowledge to improve topic estimation in short texts and small datasets.

Contribution

It unifies topic modeling and language modeling with embeddings priors, addressing language structure loss and data sparsity in probabilistic topic models.

Findings

01

Outperforms state-of-the-art models in perplexity, coherence, retrieval, and classification.

02

Effective on both long and short text datasets from diverse domains.

03

Enhances topic modeling in small and short-text corpora.

Abstract

We address two challenges of probabilistic topic modelling in order to better estimate the probability of a word in a given context, i.e., P(word|context): (1) No Language Structure in Context: Probabilistic topic models ignore word order by summarizing a given context as a "bag-of-word" and consequently the semantics of words in the context is lost. The LSTM-LM learns a vector-space representation of each word by accounting for word order in local collocation patterns and models complex characteristics of language (e.g., syntax and semantics), while the TM simultaneously learns a latent representation from the entire document and discovers the underlying thematic structure. We unite two complementary paradigms of learning the meaning of word occurrences by combining a TM (e.g., DocNADE) and a LM in a unified probabilistic framework, named as ctx-DocNADE. (2) Limited Context and/or…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

pgcool/textTOvec
tfOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsInterpretability