MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation
Najmeh Sadoughi, Xinyu Li, Avijit Vajpayee, David Fan, Bing Shuai,, Hector Santos-Villalobos, Vimal Bhat, Rohith MV

TL;DR
This paper introduces MEGA, a multimodal approach for long-form cinematic video segmentation that aligns, fuses, and distills multiple media modalities to improve scene and act segmentation accuracy.
Contribution
MEGA presents a novel multimodal alignment, fusion, and distillation framework specifically designed for long cinematic videos, enhancing segmentation performance.
Findings
Outperforms state-of-the-art on MovieNet with +1.19% AP
Outperforms on TRIPOD with +5.51% agreement
Effective multimodal synchronization and label transfer
Abstract
Previous research has studied the task of segmenting cinematic videos into scenes and into narrative acts. However, these studies have overlooked the essential task of multimodal alignment and fusion for effectively and efficiently processing long-form videos (>60min). In this paper, we introduce Multimodal alignmEnt aGgregation and distillAtion (MEGA) for cinematic long-video segmentation. MEGA tackles the challenge by leveraging multiple media modalities. The method coarsely aligns inputs of variable lengths and different modalities with alignment positional encoding. To maintain temporal synchronization while reducing computation, we further introduce an enhanced bottleneck fusion layer which uses temporal alignment. Additionally, MEGA employs a novel contrastive loss to synchronize and transfer labels across modalities, enabling act segmentation from labeled synopsis sentences on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
MEGA: Multimodal Alignment Aggregation and Distillation For Cinematic Video Segmentation· youtube
Taxonomy
TopicsHuman Pose and Action Recognition · Video Analysis and Summarization · Advanced Image Processing Techniques
