Bilingual Topic Models for Comparable Corpora
Georgios Balikas, Massih-Reza Amini, Marianne Clausel

TL;DR
This paper introduces a flexible bilingual topic model that allows paired documents to have separate but related topic distributions, improving modeling of comparable corpora with varying thematic similarity using cross-lingual embeddings.
Contribution
It proposes a novel binding mechanism for bilingual topic models that depends on semantic similarity, extending existing models to better handle comparable corpora.
Findings
Improved topic coherence measured by normalized point-wise mutual information.
Enhanced cross-lingual document retrieval performance.
Better generalization demonstrated by perplexity metrics.
Abstract
Probabilistic topic models like Latent Dirichlet Allocation (LDA) have been previously extended to the bilingual setting. A fundamental modeling assumption in several of these extensions is that the input corpora are in the form of document pairs whose constituent documents share a single topic distribution. However, this assumption is strong for comparable corpora that consist of documents thematically similar to an extent only, which are, in turn, the most commonly available or easy to obtain. In this paper we relax this assumption by proposing for the paired documents to have separate, yet bound topic distributions. % a binding mechanism between the distributions of the paired documents. We suggest that the strength of the bound should depend on each pair's semantic similarity. To estimate the similarity of documents that are written in different languages we use cross-lingual word…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques
MethodsLinear Discriminant Analysis
