Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation

Galann Pennec; Zhengyuan Liu; Nicholas Asher; Philippe Muller; Nancy F. Chen

arXiv:2505.06594·cs.CL·November 3, 2025

Integrating Video and Text: A Balanced Approach to Multimodal Summary Generation and Evaluation

Galann Pennec, Zhengyuan Liu, Nicholas Asher, Philippe Muller, Nancy F. Chen

PDF

Open Access

TL;DR

This paper presents a zero-shot multimodal summarization method that creates screenplay summaries integrating video, dialogue, and characters, and introduces a new metric for evaluating multimodal summaries, outperforming existing models.

Contribution

The paper introduces a novel zero-shot video-to-text summarization approach that generates screenplay summaries and a multimodal evaluation metric, addressing limitations of existing methods.

Findings

01

Generated summaries contain 20% more relevant visual information.

02

Requires 75% less video input than state-of-the-art models.

03

Outperforms Gemini 1.5 on the SummScreen3D dataset.

Abstract

Vision-Language Models (VLMs) often struggle to balance visual and textual information when summarizing complex multimodal inputs, such as entire TV show episodes. In this paper, we propose a zero-shot video-to-text summarization approach that builds its own screenplay representation of an episode, effectively integrating key video moments, dialogue, and character information into a unified document. Unlike previous approaches, we simultaneously generate screenplays and name the characters in zero-shot, using only the audio, video, and transcripts as input. Additionally, we highlight that existing summarization metrics can fail to assess the multimodal content in summaries. To address this, we introduce MFactSum, a multimodal metric that evaluates summaries with respect to both vision and text modalities. Using MFactSum, we evaluate our screenplay summaries on the SummScreen3D dataset,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Topic Modeling