Bidirectional Chinese and English Passive Sentences Dataset for Machine Translation
Xinyue Ma, Pol Pastells, Mireia Farr\'us, Mariona Taul\'e

TL;DR
This paper introduces a large, annotated bilingual dataset of passive sentences for English-Chinese translation, enabling better evaluation of machine translation systems' handling of passive constructions.
Contribution
It provides a novel, multi-domain dataset of passive sentences with automatic and manual annotations, specifically designed for evaluating and improving MT systems on linguistic phenomena.
Findings
Models tend to preserve source voice rather than adapt to target language norms.
Chinese passives' low frequency and negative context influence translation quality.
LLMs generate more diverse translations compared to traditional MT models.
Abstract
Machine Translation (MT) evaluation has gone beyond metrics, towards more specific linguistic phenomena. Regarding English-Chinese language pairs, passive sentences are constructed and distributed differently due to language variation, thus need special attention in MT. This paper proposes a bidirectional multi-domain dataset of passive sentences, extracted from five Chinese-English parallel corpora and annotated automatically with structure labels according to human translation, and a test set with manually verified annotation. The dataset consists of 73,965 parallel sentence pairs (2,358,731 English words, 3,498,229 Chinese characters). We evaluate two state-of-the-art open-source MT systems with our dataset, and four commercial models with the test set. The results show that, unlike humans, models are more influenced by the voice of the source text rather than the general voice usage…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Authorship Attribution and Profiling
