Decoding Machine Translationese in English-Chinese News: LLMs vs. NMTs
Delu Kong, Lieve Macken

TL;DR
This paper investigates the linguistic features of machine translation outputs in English-Chinese news texts, revealing distinct patterns and differences between neural machine translation systems and large language models.
Contribution
It introduces a large dataset and comprehensive feature analysis to identify and compare translationese in LLMs and NMTs for English-Chinese news translation.
Findings
MTese is detectable in both LLMs and NMTs.
Original Chinese texts are nearly perfectly distinguishable from machine outputs.
LLMs show greater lexical diversity than NMTs.
Abstract
This study explores Machine Translationese (MTese) -- the linguistic peculiarities of machine translation outputs -- focusing on the under-researched English-to-Chinese language pair in news texts. We construct a large dataset consisting of 4 sub-corpora and employ a comprehensive five-layer feature set. Then, a chi-square ranking algorithm is applied for feature selection in both classification and clustering tasks. Our findings confirm the presence of MTese in both Neural Machine Translation systems (NMTs) and Large Language Models (LLMs). Original Chinese texts are nearly perfectly distinguishable from both LLM and NMT outputs. Notable linguistic patterns in MT outputs are shorter sentence lengths and increased use of adversative conjunctions. Comparing LLMs and NMTs, we achieve approximately 70% classification accuracy, with LLMs exhibiting greater lexical diversity and NMTs using…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Translation Studies and Practices · Topic Modeling
