LanguageBind: Extending Video-Language Pretraining to N-modality by   Language-based Semantic Alignment

Bin Zhu; Bin Lin; Munan Ning; Yang Yan; Jiaxi Cui; HongFa Wang; Yatian; Pang; Wenhao Jiang; Junwu Zhang; Zongwei Li; Wancai Zhang; Zhifeng Li; Wei; Liu; and Li Yuan

arXiv:2310.01852·cs.CV·January 23, 2024·25 cites

LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment

Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian, Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei, Liu, and Li Yuan

PDF

Open Access 5 Repos 10 Models 3 Datasets

TL;DR

LanguageBind introduces a novel framework that extends video-language pretraining to N modalities by using language as a semantic anchor, enabling multi-modal alignment and improving performance across diverse benchmarks.

Contribution

The paper proposes a new method to extend video-language pretraining to multiple modalities using language as a semantic bridge, along with a large aligned dataset VIDAL-10M.

Findings

01

Achieved superior performance on 15 diverse benchmarks.

02

Effectively aligned multiple modalities to language in shared feature space.

03

Demonstrated the effectiveness of indirect alignment and modality complementarity.

Abstract

The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Cancer-related molecular mechanisms research