Cut to the Chase: Training-free Multimodal Summarization via Chain-of-Events
Xiaoxing You, Qiang Huang, Lingyu Li, Xiaojun Chang, Jun Yu

TL;DR
This paper introduces CoE, a training-free multimodal summarization framework that uses a Chain-of-Events approach guided by a Hierarchical Event Graph to improve cross-modal understanding and event reasoning without domain-specific supervision.
Contribution
The paper proposes a novel training-free MMS method leveraging a Hierarchical Event Graph for explicit event reasoning, addressing limitations of implicit fusion and flat temporal modeling.
Findings
Outperforms state-of-the-art video CoT baselines across eight datasets.
Achieves significant improvements in ROUGE, CIDEr, and BERTScore metrics.
Demonstrates robustness, interpretability, and cross-domain generalization.
Abstract
Multimodal Summarization (MMS) aims to generate concise textual summaries by understanding and integrating information across videos, transcripts, and images. However, existing approaches still suffer from three main challenges: (1) reliance on domain-specific supervision, (2) implicit fusion with weak cross-modal grounding, and (3) flat temporal modeling without event transitions. To address these issues, we introduce **CoE**, a training-free MMS framework that performs structured reasoning through a **Chain-of-Events** guided by a Hierarchical Event Graph (HEG). The HEG encodes textual semantics into an explicit event hierarchy that scaffolds cross-modal grounding and temporal reasoning. Guided by this structure, **CoE** localizes key visual cues, models event evolution and causal transitions, and refines outputs via lightweight style adaptation for domain alignment. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Video Analysis and Summarization
