TL;DR
This paper introduces ctx-DocNADEe, a neural autoregressive topic model that combines language modeling and external knowledge to improve topic estimation in short texts and small datasets.
Contribution
It unifies topic modeling and language modeling with embeddings priors, addressing language structure loss and data sparsity in probabilistic topic models.
Findings
Outperforms state-of-the-art models in perplexity, coherence, retrieval, and classification.
Effective on both long and short text datasets from diverse domains.
Enhances topic modeling in small and short-text corpora.
Abstract
We address two challenges of probabilistic topic modelling in order to better estimate the probability of a word in a given context, i.e., P(word|context): (1) No Language Structure in Context: Probabilistic topic models ignore word order by summarizing a given context as a "bag-of-word" and consequently the semantics of words in the context is lost. The LSTM-LM learns a vector-space representation of each word by accounting for word order in local collocation patterns and models complex characteristics of language (e.g., syntax and semantics), while the TM simultaneously learns a latent representation from the entire document and discovers the underlying thematic structure. We unite two complementary paradigms of learning the meaning of word occurrences by combining a TM (e.g., DocNADE) and a LM in a unified probabilistic framework, named as ctx-DocNADE. (2) Limited Context and/or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsInterpretability
