Multilingual Topic Models for Unaligned Text

Jordan Boyd-Graber; David Blei

arXiv:1205.2657·cs.CL·May 14, 2012·45 cites

Multilingual Topic Models for Unaligned Text

Jordan Boyd-Graber, David Blei

PDF

Open Access

TL;DR

This paper introduces MuTo, a probabilistic multilingual topic model that discovers shared topics and matches between documents in different languages without requiring aligned or parallel corpora.

Contribution

The paper presents MuTo, a novel multilingual topic model that uses stochastic EM to find language matches and shared topics in unaligned multilingual texts.

Findings

01

MuTo effectively finds shared topics in real-world multilingual corpora.

02

MuTo successfully pairs related documents across languages.

03

The model operates without the need for parallel corpora.

Abstract

We develop the multilingual topic model for unaligned text (MuTo), a probabilistic model of text that is designed to analyze corpora composed of documents in two languages. From these documents, MuTo uses stochastic EM to simultaneously discover both a matching between the languages and multilingual latent topics. We demonstrate that MuTo is able to find shared topics on real-world multilingual corpora, successfully pairing related documents across languages. MuTo provides a new framework for creating multilingual topic models without needing carefully curated parallel corpora and allows applications built using the topic model formalism to be applied to a much wider class of corpora.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques