Multilingual De-Duplication Strategies: Applying scalable similarity search with monolingual & multilingual embedding models
Stefan Pasch, Dimitirios Petridis, Jannic Cutura

TL;DR
This paper compares two multilingual deduplication methods using embedding models, demonstrating that translation plus embedding outperforms direct multilingual embedding, especially for less common languages, with potential for further accuracy improvements.
Contribution
It introduces a comparative analysis of multilingual deduplication strategies, highlighting the effectiveness of translation-based methods over direct multilingual embeddings.
Findings
Two-step method with translation and embedding achieves higher F1 scores.
Multilingual embedding models perform less effectively on low-resource languages.
Expert rules can further improve deduplication accuracy.
Abstract
This paper addresses the deduplication of multilingual textual data using advanced NLP tools. We compare a two-step method involving translation to English followed by embedding with mpnet, and a multilingual embedding model (distiluse). The two-step approach achieved a higher F1 score (82% vs. 60%), particularly with less widely used languages, which can be increased up to 89% by leveraging expert rules based on domain knowledge. We also highlight limitations related to token length constraints and computational efficiency. Our methodology suggests improvements for future multilingual deduplication tasks.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies
