TL;DR
WikiPDA is a novel crosslingual topic model that leverages Wikipedia links and Wikidata to produce language-independent topics, enabling applications like bias analysis and zero-shot language transfer.
Contribution
It introduces WikiPDA, a new method combining link densification and monolingual topic modeling for crosslingual analysis without language-specific training.
Findings
WikiPDA yields more coherent topics than monolingual LDA.
It enables crosslingual applications like bias analysis and classification.
The model supports zero-shot transfer to new languages.
Abstract
We present Wikipedia-based Polyglot Dirichlet Allocation (WikiPDA), a crosslingual topic model that learns to represent Wikipedia articles written in any language as distributions over a common set of language-independent topics. It leverages the fact that Wikipedia articles link to each other and are mapped to concepts in the Wikidata knowledge base, such that, when represented as bags of links, articles are inherently language-independent. WikiPDA works in two steps, by first densifying bags of links using matrix completion and then training a standard monolingual topic model. A human evaluation shows that WikiPDA produces more coherent topics than monolingual text-based LDA, thus offering crosslinguality at no cost. We demonstrate WikiPDA's utility in two applications: a study of topical biases in 28 Wikipedia editions, and crosslingual supervised classification. Finally, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Discriminant Analysis
