On the Influence of Machine Translation on Language Origin Obfuscation
Benjamin Murauer, Michael Tschuggnall, G\"unther Specht

TL;DR
This paper investigates how machine translation can obscure the original language of a text and demonstrates that source language detection remains highly accurate using machine learning on translated outputs, especially with larger documents.
Contribution
It provides an analysis of the effectiveness of source language detection from translated texts and explores factors affecting detection accuracy, such as document size and language set limitations.
Findings
High accuracy in source language detection from translated texts.
Document size significantly influences detection performance.
Limiting possible source languages improves classification accuracy.
Abstract
In the last decade, machine translation has become a popular means to deal with multilingual digital content. By providing higher quality translations, obfuscating the source language of a text becomes more attractive. In this paper, we analyze the ability to detect the source language from the translated output of two widely used commercial machine translation systems by utilizing machine-learning algorithms with basic textual features like n-grams. Evaluations show that the source language can be reconstructed with high accuracy for documents that contain a sufficient amount of translated text. In addition, we analyze how the document size influences the performance of the prediction, as well as how limiting the set of possible source languages improves the classification accuracy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Digital Media Forensic Detection
