Unified Video-Language Pre-training with Synchronized Audio

Shentong Mo; Haofan Wang; Huaxia Li; Xu Tang

arXiv:2405.07202·cs.CV·May 14, 2024

Unified Video-Language Pre-training with Synchronized Audio

Shentong Mo, Haofan Wang, Huaxia Li, Xu Tang

PDF

Open Access

TL;DR

This paper introduces VLSA, a unified self-supervised transformer framework that effectively learns tri-modal video, audio, and text representations by explicitly modeling their natural synchronization, leading to improved retrieval performance.

Contribution

It proposes VLSA, a novel tri-modal pre-training approach that explicitly incorporates audio-visual-textual synchronization and local-patch masked modeling for enhanced multimodal understanding.

Findings

01

VLSA outperforms state-of-the-art methods on retrieval tasks.

02

Pre-training on only 0.9M data yields significant improvements.

03

Qualitative visualizations demonstrate better discriminative representations.

Abstract

Video-language pre-training is a typical and challenging problem that aims at learning visual and textual representations from large-scale data in a self-supervised way. Existing pre-training approaches either captured the correspondence of image-text pairs or utilized temporal ordering of frames. However, they do not explicitly explore the natural synchronization between audio and the other two modalities. In this work, we propose an enhanced framework for Video-Language pre-training with Synchronized Audio, termed as VLSA, that can learn tri-modal representations in a unified self-supervised transformer. Specifically, our VLSA jointly aggregates embeddings of local patches and global tokens for video, text, and audio. Furthermore, we utilize local-patch masked modeling to learn modality-aware features, and leverage global audio matching to capture audio-guided features for video and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSubtitles and Audiovisual Media · Video Analysis and Summarization · Multimedia Communication and Technology