Automatic Identification of Document Translations in Large Multilingual Document Collections
Bruno Pouliquen, Ralf Steinberger, Camelia Ignat

TL;DR
This paper introduces a system that accurately identifies translated documents within large multilingual collections by using semantic vector representations, achieving over 96% precision in diverse text types.
Contribution
The novel system effectively detects document translations across languages using thesaurus-based semantic vectors, with high precision and the ability to handle large datasets.
Findings
Achieved over 96% precision in translation detection.
Effective in large search spaces of 820+ documents.
Can be used for cross-lingual plagiarism detection.
Abstract
Texts and their translations are a rich linguistic resource that can be used to train and test statistics-based Machine Translation systems and many other applications. In this paper, we present a working system that can identify translations and other very similar documents among a large number of candidates, by representing the document contents with a vector of thesaurus terms from a multilingual thesaurus, and by then measuring the semantic similarity between the vectors. Tests on different text types have shown that the system can detect translations with over 96% precision in a large search space of 820 documents or more. The system was tuned to ignore language-specific similarities and to give similar documents in a second language the same similarity score as equivalent documents in the same language. The application can also be used to detect cross-lingual document plagiarism.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression
