Automatic Machine Translation Detection Using a Surrogate Multilingual Translation Model
Cristian Garc\'ia-Romero, Miquel Espl\`a-Gomis, Felipe S\'anchez-Mart\'inez

TL;DR
This paper introduces a novel method using internal representations of a multilingual MT model to accurately identify machine-translated texts, improving filtering processes for training data especially in non-English languages.
Contribution
The paper presents a new approach leveraging surrogate multilingual models' internal features to detect machine translations, outperforming existing methods.
Findings
Achieves at least 5 percentage points higher accuracy than state-of-the-art methods.
Particularly effective for non-English language pairs.
Demonstrates the importance of internal model representations in translation detection.
Abstract
Modern machine translation (MT) systems depend on large parallel corpora, often collected from the Internet. However, recent evidence indicates that (i) a substantial portion of these texts are machine-generated translations, and (ii) an overreliance on such synthetic content in training data can significantly degrade translation quality. As a result, filtering out non-human translations is becoming an essential pre-processing step in building high-quality MT systems. In this work, we propose a novel approach that directly exploits the internal representations of a surrogate multilingual MT model to distinguish between human and machine-translated sentences. Experimental results show that our method outperforms current state-of-the-art techniques, particularly for non-English language pairs, achieving gains of at least 5 percentage points of accuracy.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices · Topic Modeling
