Revisiting Cross-Lingual Summarization: A Corpus-based Study and A New   Benchmark with Improved Annotation

Yulong Chen; Huajian Zhang; Yijie Zhou; Xuefeng Bai; Yueguan Wang,; Ming Zhong; Jianhao Yan; Yafu Li; Judy Li; Michael Zhu; Yue Zhang

arXiv:2307.04018·cs.CL·July 11, 2023

Revisiting Cross-Lingual Summarization: A Corpus-based Study and A New Benchmark with Improved Annotation

Yulong Chen, Huajian Zhang, Yijie Zhou, Xuefeng Bai, Yueguan Wang,, Ming Zhong, Jianhao Yan, Yafu Li, Judy Li, Michael Zhu, Yue Zhang

PDF

Open Access 1 Repo

TL;DR

This paper introduces ConvSumX, a new cross-lingual summarization benchmark with improved annotation considering source context, and proposes a 2-Step method that outperforms existing baselines by leveraging both conversation and summary inputs.

Contribution

The paper presents ConvSumX, a novel benchmark for cross-lingual summarization with explicit source context annotation, and a 2-Step method that enhances summarization quality by mimicking human annotation.

Findings

01

ConvSumX is more faithful to input text than existing corpora.

02

The 2-Step method outperforms strong baselines in automatic and human evaluations.

03

Both source input and summary are vital for effective cross-lingual summarization.

Abstract

Most existing cross-lingual summarization (CLS) work constructs CLS corpora by simply and directly translating pre-annotated summaries from one language to another, which can contain errors from both summarization and translation processes. To address this issue, we propose ConvSumX, a cross-lingual conversation summarization benchmark, through a new annotation schema that explicitly considers source input context. ConvSumX consists of 2 sub-tasks under different real-world scenarios, with each covering 3 language directions. We conduct thorough analysis on ConvSumX and 3 widely-used manually annotated CLS corpora and empirically find that ConvSumX is more faithful towards input text. Additionally, based on the same intuition, we propose a 2-Step method, which takes both conversation and summary as input to simulate human annotation process. Experimental results show that 2-Step method…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

cylnlp/convsumx
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Text Readability and Simplification