Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

Xiaoxing You; Qiang Huang; Lingyu Li; Xiaojun Chang; Jun Yu

arXiv:2603.06213·cs.CV·March 9, 2026

Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events

Xiaoxing You, Qiang Huang, Lingyu Li, Xiaojun Chang, Jun Yu

PDF

Open Access

TL;DR

This paper introduces CoE, a training-free multimodal summarization framework that uses a Chain-of-Events approach guided by a Hierarchical Event Graph to improve cross-modal understanding and event reasoning without domain-specific supervision.

Contribution

The paper proposes a novel training-free MMS method leveraging a Hierarchical Event Graph for explicit event reasoning, addressing limitations of implicit fusion and flat temporal modeling.

Findings

01

Outperforms state-of-the-art video CoT baselines across eight datasets.

02

Achieves significant improvements in ROUGE, CIDEr, and BERTScore metrics.

03

Demonstrates robustness, interpretability, and cross-domain generalization.

Abstract

Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce **CoE**, a training-free MMS framework that performs structured reasoning through a **Chain-of-Events** guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, **CoE** localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Video Analysis and Summarization