An Empirical Study of Many-to-Many Summarization with Large Language Models

Jiaan Wang; Fandong Meng; Zengkui Sun; Yunlong Liang; Yuxuan Cao; Jiarong Xu; Haoxiang Shi; Jie Zhou

arXiv:2505.12983·cs.CL·May 20, 2025

An Empirical Study of Many-to-Many Summarization with Large Language Models

Jiaan Wang, Fandong Meng, Zengkui Sun, Yunlong Liang, Yuxuan Cao, Jiarong Xu, Haoxiang Shi, Jie Zhou

PDF

Open Access 1 Video

TL;DR

This study systematically evaluates large language models' ability to perform many-to-many multilingual summarization, showing that instruction tuning enhances their performance and highlighting ongoing challenges with factual accuracy.

Contribution

It provides a comprehensive benchmark and analysis of LLMs' M2MS capabilities across multiple languages and domains, including the impact of instruction tuning.

Findings

01

Zero-shot LLMs perform competitively with traditional models.

02

Instruction tuning significantly improves LLMs' M2MS performance.

03

Factuality issues persist and may be worsened by instruction tuning.

Abstract

Many-to-many summarization (M2MS) aims to process documents in any language and generate the corresponding summaries also in any language. Recently, large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform M2MS in real applications. This work presents a systematic empirical study on LLMs' M2MS ability. Specifically, we first reorganize M2MS data based on eight previous domain-specific datasets. The reorganized data contains 47.8K samples spanning five domains and six languages, which could be used to train and evaluate LLMs. Then, we benchmark 18 LLMs in a zero-shot manner and an instruction-tuning manner. Fine-tuned traditional models (e.g., mBART) are also conducted for comparisons. Our experiments reveal that, zero-shot LLMs achieve competitive results with fine-tuned traditional models. After instruct-tuning, open-source LLMs can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

An Empirical Study of Many-to-Many Summarization with Large Language Models· underline

Taxonomy

TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Text and Document Classification Technologies