Accommodating Audio Modality in CLIP for Multimodal Processing

Ludan Ruan; Anwen Hu; Yuqing Song; Liang Zhang; Sipeng Zheng; Qin Jin

arXiv:2303.06591·cs.CV·March 14, 2023·1 cites

Accommodating Audio Modality in CLIP for Multimodal Processing

Ludan Ruan, Anwen Hu, Yuqing Song, Liang Zhang, Sipeng Zheng, Qin Jin

PDF

Open Access 1 Repo

TL;DR

This paper extends the CLIP model to include audio modality, enabling effective multimodal processing of vision, language, and audio, and achieves state-of-the-art results on multiple video and audio benchmarks.

Contribution

The paper introduces CLIP4VLA, a novel extension of CLIP that incorporates audio modality with contrastive learning and an audio type token for dynamic audio understanding.

Findings

01

Achieves state-of-the-art performance on MSR-VTT, VATEX, and AudioCaps datasets.

02

Effectively models correlations between audio, vision, and language modalities.

03

Demonstrates improved video retrieval and captioning tasks.

Abstract

Multimodal processing has attracted much attention lately especially with the success of pre-training. However, the exploration has mainly focused on vision-language pre-training, as introducing more modalities can greatly complicate model design and optimization. In this paper, we extend the stateof-the-art Vision-Language model CLIP to accommodate the audio modality for Vision-Language-Audio multimodal processing. Specifically, we apply inter-modal and intra-modal contrastive learning to explore the correlation between audio and other modalities in addition to the inner characteristics of the audio modality. Moreover, we further design an audio type token to dynamically learn different audio information type for different scenarios, as both verbal and nonverbal heterogeneous information is conveyed in general audios. Our proposed CLIP4VLA model is validated in different downstream…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

ludanruan/clip4vla
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMusic and Audio Processing · Subtitles and Audiovisual Media · Multimodal Machine Learning Applications

MethodsContrastive Learning · Contrastive Language-Image Pre-training