Understanding Translationese in Cross-Lingual Summarization

Jiaan Wang; Fandong Meng; Yunlong Liang; Tingyi Zhang; Jiarong Xu,; Zhixu Li; Jie Zhou

arXiv:2212.07220·cs.CL·October 11, 2023

Understanding Translationese in Cross-Lingual Summarization

Jiaan Wang, Fandong Meng, Yunlong Liang, Tingyi Zhang, Jiarong Xu,, Zhixu Li, Jie Zhou

PDF

Open Access

TL;DR

This paper investigates how translationese affects cross-lingual summarization, revealing its impact on evaluation, model performance, and dataset construction, and offers guidelines for future research in the field.

Contribution

It systematically analyzes the influence of translationese on CLS datasets, evaluation, and models, providing insights and recommendations for future dataset and model development.

Findings

01

Translationese in test sets causes evaluation discrepancies.

02

Training on translationese can harm real-world model performance.

03

Machine-translated documents are useful for low-resource CLS systems.

Abstract

Given a document in a source language, cross-lingual summarization (CLS) aims at generating a concise summary in a different target language. Unlike monolingual summarization (MS), naturally occurring source-language documents paired with target-language summaries are rare. To collect large-scale CLS data, existing datasets typically involve translation in their creation. However, the translated text is distinguished from the text originally written in that language, i.e., translationese. In this paper, we first confirm that different approaches of constructing CLS datasets will lead to different degrees of translationese. Then we systematically investigate how translationese affects CLS model evaluation and performance when it appears in source documents or target summaries. In detail, we find that (1) the translationese in documents or summaries of test sets might lead to the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Text and Document Classification Technologies

MethodsTest