Measuring What Matters Beyond Text: Evaluating Multimodal Summaries by Quality, Alignment, and Diversity
Abid Ali, Diego Molla-Aliod, Usman Naseem

TL;DR
This paper introduces MM-Eval, a comprehensive framework for evaluating multimodal summaries by jointly assessing text quality, image-text alignment, and visual diversity, addressing the fragmentation in current evaluation methods.
Contribution
The authors propose MM-Eval, a unified, interpretable evaluation framework that combines multiple assessment metrics and calibrates them with human preferences for better multimodal summary evaluation.
Findings
Factual consistency is the key determinant of overall quality.
MM-Eval outperforms heuristic aggregation baselines.
Visual relevance and diversity are important but secondary to text quality.
Abstract
Multimodal Large Language Models (MLLMs) have facilitated Multimodal Summarization with Multimodal Output (MSMO), wherein systems generate concise textual summaries accompanied by salient visuals from multimodal sources. However, current MSMO evaluation remains fragmented: text quality, image-text alignment, and visual diversity are typically assessed in isolation using unimodal metrics, making it difficult to capture whether the modalities jointly support a faithful and useful summary. To address this gap, we introduce MM-Eval, a unified evaluation framework that integrates assessments of textual quality, cross-modal alignment, and visual diversity. MM-Eval comprises three components: (1) text quality, measured using OpenFActScore for factual consistency and G-Eval for coherence, fluency, and relevance; (2) image-text relevance, evaluated via an MLLM-as-a-judge approach; and (3)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
