Using CollGram to Compare Formulaic Language in Human and Neural Machine Translation
Yves Bestgen

TL;DR
This paper compares formulaic language in human and neural machine translation of news articles, revealing significant differences in frequency and association of formulaic sequences, with implications for translation quality and system performance.
Contribution
It introduces CollGram as a tool to analyze and compare formulaic sequences in human and neural machine translations, highlighting systematic differences and system variations.
Findings
Neural machine translations contain fewer low-frequency, strongly-associated formulaic sequences.
Neural machine translations have more high-frequency formulaic sequences.
Differences are statistically significant with medium or large effect sizes.
Abstract
A comparison of formulaic sequences in human and neural machine translation of quality newspaper articles shows that neural machine translations contain less lower-frequency, but strongly-associated formulaic sequences, and more high-frequency formulaic sequences. These differences were statistically significant and the effect sizes were almost always medium or large. These observations can be related to the differences between second language learners of various levels and between translated and untranslated texts. The comparison between the neural machine translation systems indicates that some systems produce more formulaic sequences of both types than other systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
