Towards Multi-dimensional Evaluation of LLM Summarization across Domains and Languages

Hyangsuk Min; Yuho Lee; Minjeong Ban; Jiaqi Deng; Nicole Hee-Yeon Kim; Taewon Yun; Hang Su; Jason Cai; Hwanjun Song

arXiv:2506.00549·cs.CL·June 3, 2025

Towards Multi-dimensional Evaluation of LLM Summarization across Domains and Languages

Hyangsuk Min, Yuho Lee, Minjeong Ban, Jiaqi Deng, Nicole Hee-Yeon Kim, Taewon Yun, Hang Su, Jason Cai, Hwanjun Song

PDF

Open Access 1 Video

TL;DR

MSumBench is a comprehensive evaluation framework for text summarization that assesses models across multiple domains and languages, incorporating specialized criteria and multi-agent debate to improve annotation quality.

Contribution

The paper introduces MSumBench, a multi-dimensional, multi-domain benchmark for summarization evaluation in English and Chinese, with domain-specific criteria and a novel debate-based annotation system.

Findings

01

Distinct performance patterns across domains and languages.

02

Large language models show bias in evaluating self-generated summaries.

03

Evaluation correlation varies with model and domain.

Abstract

Evaluation frameworks for text summarization have evolved in terms of both domain coverage and metrics. However, existing benchmarks still lack domain-specific assessment criteria, remain predominantly English-centric, and face challenges with human annotation due to the complexity of reasoning. To address these, we introduce MSumBench, which provides a multi-dimensional, multi-domain evaluation of summarization in English and Chinese. It also incorporates specialized assessment criteria for each domain and leverages a multi-agent debate system to enhance annotation quality. By evaluating eight modern summarization models, we discover distinct performance patterns across domains and languages. We further examine large language models as summary evaluators, analyzing the correlation between their evaluation and summarization capabilities, and uncovering systematic bias in their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Towards Multi-dimensional Evaluation of LLM Summarization across Domains and Languages· underline

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Semantic Web and Ontologies