Exploiting a comparability mapping to improve bi-lingual data categorization: a three-mode data analysis perspective
Pierre-Fran\c{c}ois Marteau (IRISA), Guiyao Ke (IRISA)

TL;DR
This paper introduces a three-mode data analysis approach that combines similarity and comparability measures to enhance bilingual data clustering and classification accuracy, demonstrated through synthetic and real Wikipedia data.
Contribution
A novel three-mode analysis scheme that integrates comparability measures with similarity measures to improve bilingual clustering and classification tasks.
Findings
Improved clustering and classification accuracy with combined measures.
Higher robustness with proposed comparability variants.
Effective for constructing thematically comparable bilingual corpora.
Abstract
We address in this paper the co-clustering and co-classification of bilingual data laying in two linguistic similarity spaces when a comparability measure defining a mapping between these two spaces is available. A new approach that we can characterized as a three-mode analysis scheme, is proposed to mix the comparability measure with the two similarity measures. Our aim is to improve jointly the accuracy of classification and clustering tasks performed in each of the two linguistic spaces, as well as the quality of the final alignment of comparable clusters that can be obtained. We used first some purely synthetic random data sets to assess our formal similarity-comparability mixing model. We then propose two variants of the comparability measure that has been defined by (Li and Gaussier 2010) in the context of bilingual lexicon extraction to adapt it to clustering or categorizing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text and Document Classification Technologies · Topic Modeling
