Multimodal Fusion and Coherence Modeling for Video Topic Segmentation

Hai Yu; Chong Deng; Qinglin Zhang; Jiaqing Liu; Qian Chen; Wen Wang

arXiv:2408.00365·cs.AI·December 31, 2024

Multimodal Fusion and Coherence Modeling for Video Topic Segmentation

Hai Yu, Chong Deng, Qinglin Zhang, Jiaqing Liu, Qian Chen, Wen Wang

PDF

Open Access

TL;DR

This paper advances video topic segmentation by developing multimodal fusion and coherence modeling techniques, including new pre-training and fine-tuning tasks, evaluated on English and Chinese educational videos.

Contribution

It introduces novel architectures for multimodal fusion, a multimodal contrastive learning pre-training, and new tasks for coherence modeling tailored to VTS, with extensive evaluation.

Findings

01

Superior performance on English lecture videos

02

Effective multimodal fusion with cross-attention and mixture of experts

03

Enhanced results on Chinese lecture dataset

Abstract

The video topic segmentation (VTS) task segments videos into intelligible, non-overlapping topics, facilitating efficient comprehension of video content and quick access to specific content. VTS is also critical to various downstream video understanding tasks. Traditional VTS methods using shallow features or unsupervised approaches struggle to accurately discern the nuances of topical transitions. Recently, supervised approaches have achieved superior performance on video action or scene segmentation over unsupervised approaches. In this work, we improve supervised VTS by thoroughly exploring multimodal fusion and multimodal coherence modeling. Specifically, (1) we enhance multimodal fusion by exploring different architectures using cross-attention and mixture of experts. (2) To generally strengthen multimodality alignment and fusion, we pre-train and fine-tune the model with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Video Surveillance and Tracking Methods