An Empirical Study of Many-to-Many Summarization with Large Language Models
Jiaan Wang, Fandong Meng, Zengkui Sun, Yunlong Liang, Yuxuan Cao, Jiarong Xu, Haoxiang Shi, Jie Zhou

TL;DR
This study systematically evaluates large language models' ability to perform many-to-many multilingual summarization, showing that instruction tuning enhances their performance and highlighting ongoing challenges with factual accuracy.
Contribution
It provides a comprehensive benchmark and analysis of LLMs' M2MS capabilities across multiple languages and domains, including the impact of instruction tuning.
Findings
Zero-shot LLMs perform competitively with traditional models.
Instruction tuning significantly improves LLMs' M2MS performance.
Factuality issues persist and may be worsened by instruction tuning.
Abstract
Many-to-many summarization (M2MS) aims to process documents in any language and generate the corresponding summaries also in any language. Recently, large language models (LLMs) have shown strong multi-lingual abilities, giving them the potential to perform M2MS in real applications. This work presents a systematic empirical study on LLMs' M2MS ability. Specifically, we first reorganize M2MS data based on eight previous domain-specific datasets. The reorganized data contains 47.8K samples spanning five domains and six languages, which could be used to train and evaluate LLMs. Then, we benchmark 18 LLMs in a zero-shot manner and an instruction-tuning manner. Fine-tuned traditional models (e.g., mBART) are also conducted for comparisons. Our experiments reveal that, zero-shot LLMs achieve competitive results with fine-tuned traditional models. After instruct-tuning, open-source LLMs can…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Sentiment Analysis and Opinion Mining · Text and Document Classification Technologies
