Unified Video-Language Pre-training with Synchronized Audio
Shentong Mo, Haofan Wang, Huaxia Li, Xu Tang

TL;DR
This paper introduces VLSA, a unified self-supervised transformer framework that effectively learns tri-modal video, audio, and text representations by explicitly modeling their natural synchronization, leading to improved retrieval performance.
Contribution
It proposes VLSA, a novel tri-modal pre-training approach that explicitly incorporates audio-visual-textual synchronization and local-patch masked modeling for enhanced multimodal understanding.
Findings
VLSA outperforms state-of-the-art methods on retrieval tasks.
Pre-training on only 0.9M data yields significant improvements.
Qualitative visualizations demonstrate better discriminative representations.
Abstract
Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way. Existing pre-training approaches either captured the correspondence of image-text pairs or utilized temporal ordering of frames. However, they do not explicitly explore the natural synchronization between audio and the other two modalities. In this work, we propose an enhanced framework for Video-Language pre-training with Synchronized Audio, termed as VLSA, that can learn tri-modal representations in a unified self-supervised transformer. Specifically, our VLSA jointly aggregates embeddings of local patches and global tokens for video, text, and audio. Furthermore, we utilize local-patch masked modeling to learn modality-aware features, and leverage global audio matching to capture audio-guided features for video and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization · Multimedia Communication and Technology
