Building and Aligning Comparable Corpora
Motaz Saad, David Langlois, Kamel Smaili

TL;DR
This paper presents a method for building and aligning comparable multilingual corpora using Wikipedia and EURONEWS, and demonstrates that cross-lingual LSI outperforms dictionary-based measures in aligning documents at topic and event levels.
Contribution
It introduces a novel approach to automatically align comparable documents across languages using cross-lingual similarity measures, especially CL-LSI, which outperforms dictionary-based methods.
Findings
CL-LSI outperforms dictionary-based similarity measures.
The method successfully aligns documents at both topic and event levels.
Experiments on Wikipedia, EURONEWS, BBC, and ALJAZEERA data validate the approach.
Abstract
Comparable corpus is a set of topic aligned documents in multiple languages, which are not necessarily translations of each other. These documents are useful for multilingual natural language processing when there is no parallel text available in some domains or languages. In addition, comparable documents are informative because they can tell what is being said about a topic in different languages. In this paper, we present a method to build comparable corpora from Wikipedia encyclopedia and EURONEWS website in English, French and Arabic languages. We further experiment a method to automatically align comparable documents using cross-lingual similarity measures. We investigate two cross-lingual similarity measures to align comparable documents. The first measure is based on bilingual dictionary, and the second measure is based on Latent Semantic Indexing (LSI). Experiments on several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification
