MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment

Shuo wang; Jihao Zhang

arXiv:2506.10430·cs.CV·June 13, 2025

MF2Summ: Multimodal Fusion for Video Summarization with Temporal Alignment

Shuo wang, Jihao Zhang

PDF

Open Access

TL;DR

MF2Summ is a multimodal video summarization model that combines visual and auditory data using advanced attention mechanisms to produce more comprehensive video summaries, showing improved performance over existing methods.

Contribution

This paper introduces MF2Summ, a novel multimodal fusion approach with temporal alignment for video summarization, utilizing cross-modal Transformers and alignment-guided self-attention.

Findings

01

Achieves higher F1-scores on SumMe and TVSum datasets compared to state-of-the-art methods.

02

Effectively models inter-modal dependencies and temporal correspondences.

03

Demonstrates the benefit of multimodal fusion in capturing video semantics.

Abstract

The rapid proliferation of online video content necessitates effective video summarization techniques. Traditional methods, often relying on a single modality (typically visual), struggle to capture the full semantic richness of videos. This paper introduces MF2Summ, a novel video summarization model based on multimodal content understanding, integrating both visual and auditory information. MF2Summ employs a five-stage process: feature extraction, cross-modal attention interaction, feature fusion, segment prediction, and key shot selection. Visual features are extracted using a pre-trained GoogLeNet model, while auditory features are derived using SoundNet. The core of our fusion mechanism involves a cross-modal Transformer and an alignment-guided self-attention Transformer, designed to effectively model inter-modal dependencies and temporal correspondences. Segment importance,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications

Methods*Communicated@Fast*How Do I Communicate to Expedia? · Convolution · Auxiliary Classifier · Absolute Position Encodings · Layer Normalization · 1x1 Convolution · Local Response Normalization · Inception Module · Max Pooling · Byte Pair Encoding