Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity

Abid Ali; Diego Molla-Aliod; Usman Naseem

arXiv:2605.11693·cs.AI·May 13, 2026

Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity

Abid Ali, Diego Molla-Aliod, Usman Naseem

PDF

TL;DR

This paper introduces MM-Eval, a comprehensive framework for evaluating multimodal summaries by jointly assessing text quality, image-text alignment, and visual diversity, addressing the fragmentation in current evaluation methods.

Contribution

The authors propose MM-Eval, a unified, interpretable evaluation framework that combines multiple assessment metrics and calibrates them with human preferences for better multimodal summary evaluation.

Findings

01

Factual consistency is the key determinant of overall quality.

02

MM-Eval outperforms heuristic aggregation baselines.

03

Visual relevance and diversity are important but secondary to text quality.

Abstract

Multimodal Large Language Models (MLLMs) have facilitated Multimodal Summarization with Multimodal Output (MSMO), wherein systems generate concise textual summaries accompanied by salient visuals from multimodal sources. However, current MSMO evaluation remains fragmented: text quality, image-text alignment, and visual diversity are typically assessed in isolation using unimodal metrics, making it difficult to capture whether the modalities jointly support a faithful and useful summary. To address this gap, we introduce MM-Eval, a unified evaluation framework that integrates assessments of textual quality, cross-modal alignment, and visual diversity. MM-Eval comprises three components: (1) text quality, measured using OpenFActScore for factual consistency and G-Eval for coherence, fluency, and relevance; (2) image-text relevance, evaluated via an MLLM-as-a-judge approach; and (3)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.