A Mixed-Language Multi-Document News Summarization Dataset and a   Graphs-Based Extract-Generate Model

Shengxiang Gao; Fang nan; Yongbing Zhang; Yuxin Huang; Kaiwen Tan,; Zhengtao Yu

arXiv:2410.09773·cs.CL·October 15, 2024

A Mixed-Language Multi-Document News Summarization Dataset and a Graphs-Based Extract-Generate Model

Shengxiang Gao, Fang nan, Yongbing Zhang, Yuxin Huang, Kaiwen Tan,, Zhengtao Yu

PDF

Open Access 1 Repo 1 Video

TL;DR

This paper introduces a new dataset for mixed-language multi-document news summarization and proposes a graph-based extract-generate model to improve summarization in multilingual, multi-document scenarios.

Contribution

The paper creates the first large-scale MLMD-news dataset and develops a novel graph-based model, advancing research in multilingual multi-document summarization.

Findings

01

Benchmarking various methods on MLMD-news dataset

02

Demonstrating effectiveness of the proposed graph-based model

03

Public release of dataset and code for research community

Abstract

Existing research on news summarization primarily focuses on single-language single-document (SLSD), single-language multi-document (SLMD) or cross-language single-document (CLSD). However, in real-world scenarios, news about a international event often involves multiple documents in different languages, i.e., mixed-language multi-document (MLMD). Therefore, summarizing MLMD news is of great significance. However, the lack of datasets for MLMD news summarization has constrained the development of research in this area. To fill this gap, we construct a mixed-language multi-document news summarization dataset (MLMD-news), which contains four different languages and 10,992 source document cluster and target summary pairs. Additionally, we propose a graph-based extract-generate model and benchmark various methods on the MLMD-news dataset and publicly release our dataset and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

southnf9/mlmd-news
pytorchOfficial

Videos

A Mixed-Language Multi-Document News Summarization Dataset and a Graphs-Based Extract-Generate Model· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Web Data Mining and Analysis