Bilingual Topic Models for Comparable Corpora

Georgios Balikas; Massih-Reza Amini; Marianne Clausel

arXiv:2111.15278·cs.CL·December 1, 2021

Bilingual Topic Models for Comparable Corpora

Georgios Balikas, Massih-Reza Amini, Marianne Clausel

PDF

Open Access

TL;DR

This paper introduces a flexible bilingual topic model that allows paired documents to have separate but related topic distributions, improving modeling of comparable corpora with varying thematic similarity using cross-lingual embeddings.

Contribution

It proposes a novel binding mechanism for bilingual topic models that depends on semantic similarity, extending existing models to better handle comparable corpora.

Findings

01

Improved topic coherence measured by normalized point-wise mutual information.

02

Enhanced cross-lingual document retrieval performance.

03

Better generalization demonstrated by perplexity metrics.

Abstract

Probabilistic topic models like Latent Dirichlet Allocation (LDA) have been previously extended to the bilingual setting. A fundamental modeling assumption in several of these extensions is that the input corpora are in the form of document pairs whose constituent documents share a single topic distribution. However, this assumption is strong for comparable corpora that consist of documents thematically similar to an extent only, which are, in turn, the most commonly available or easy to obtain. In this paper we relax this assumption by proposing for the paired documents to have separate, yet bound topic distributions. % a binding mechanism between the distributions of the paired documents. We suggest that the strength of the bound should depend on each pair's semantic similarity. To estimate the similarity of documents that are written in different languages we use cross-lingual word…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Text Analysis Techniques

MethodsLinear Discriminant Analysis