Automatic Identification of Document Translations in Large Multilingual   Document Collections

Bruno Pouliquen; Ralf Steinberger; Camelia Ignat

arXiv:cs/0609060·cs.CL·May 23, 2007·53 cites

Automatic Identification of Document Translations in Large Multilingual Document Collections

Bruno Pouliquen, Ralf Steinberger, Camelia Ignat

PDF

Open Access

TL;DR

This paper introduces a system that accurately identifies translated documents within large multilingual collections by using semantic vector representations, achieving over 96% precision in diverse text types.

Contribution

The novel system effectively detects document translations across languages using thesaurus-based semantic vectors, with high precision and the ability to handle large datasets.

Findings

01

Achieved over 96% precision in translation detection.

02

Effective in large search spaces of 820+ documents.

03

Can be used for cross-lingual plagiarism detection.

Abstract

Texts and their translations are a rich linguistic resource that can be used to train and test statistics-based Machine Translation systems and many other applications. In this paper, we present a working system that can identify translations and other very similar documents among a large number of candidates, by representing the document contents with a vector of thesaurus terms from a multilingual thesaurus, and by then measuring the semantic similarity between the vectors. Tests on different text types have shown that the system can detect translations with over 96% precision in a large search space of 820 documents or more. The system was tuned to ignore language-specific similarities and to give similar documents in a second language the same similarity score as equivalent documents in the same language. The application can also be used to detect cross-lingual document plagiarism.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Algorithms and Data Compression