Multilingual Topic Models for Unaligned Text
Jordan Boyd-Graber, David Blei

TL;DR
This paper introduces MuTo, a probabilistic multilingual topic model that discovers shared topics and matches between documents in different languages without requiring aligned or parallel corpora.
Contribution
The paper presents MuTo, a novel multilingual topic model that uses stochastic EM to find language matches and shared topics in unaligned multilingual texts.
Findings
MuTo effectively finds shared topics in real-world multilingual corpora.
MuTo successfully pairs related documents across languages.
The model operates without the need for parallel corpora.
Abstract
We develop the multilingual topic model for unaligned text (MuTo), a probabilistic model of text that is designed to analyze corpora composed of documents in two languages. From these documents, MuTo uses stochastic EM to simultaneously discover both a matching between the languages and multilingual latent topics. We demonstrate that MuTo is able to find shared topics on real-world multilingual corpora, successfully pairing related documents across languages. MuTo provides a new framework for creating multilingual topic models without needing carefully curated parallel corpora and allows applications built using the topic model formalism to be applied to a much wider class of corpora.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
