LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
Bin Zhu, Bin Lin, Munan Ning, Yang Yan, Jiaxi Cui, HongFa Wang, Yatian, Pang, Wenhao Jiang, Junwu Zhang, Zongwei Li, Wancai Zhang, Zhifeng Li, Wei, Liu, and Li Yuan

TL;DR
LanguageBind introduces a novel framework that extends video-language pretraining to N modalities by using language as a semantic anchor, enabling multi-modal alignment and improving performance across diverse benchmarks.
Contribution
The paper proposes a new method to extend video-language pretraining to multiple modalities using language as a semantic bridge, along with a large aligned dataset VIDAL-10M.
Findings
Achieved superior performance on 15 diverse benchmarks.
Effectively aligned multiple modalities to language in shared feature space.
Demonstrated the effectiveness of indirect alignment and modality complementarity.
Abstract
The video-language (VL) pretraining has achieved remarkable improvement in multiple downstream tasks. However, the current VL pretraining framework is hard to extend to multiple modalities (N modalities, N>=3) beyond vision and language. We thus propose LanguageBind, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics. Specifically, we freeze the language encoder acquired by VL pretraining, then train encoders for other modalities with contrastive learning. As a result, all modalities are mapped to a shared feature space, implementing multi-modal semantic alignment. While LanguageBind ensures that we can extend VL modalities to N modalities, we also need a high-quality dataset with alignment data pairs centered on language. We thus propose VIDAL-10M with Video, Infrared, Depth, Audio and their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗LanguageBind/LanguageBind_Thermalmodel· 268 dl· ♡ 1268 dl♡ 1
- 🤗LanguageBind/LanguageBind_Audiomodel· 293 dl· ♡ 3293 dl♡ 3
- 🤗LanguageBind/LanguageBind_Videomodel· 5.6k dl· ♡ 35.6k dl♡ 3
- 🤗LanguageBind/LanguageBind_Depthmodel· 242 dl242 dl
- 🤗LanguageBind/LanguageBind_Imagemodel· 28k dl· ♡ 1128k dl♡ 11
- 🤗LanguageBind/Video-LLaVA-Pretrain-7Bmodel· 19 dl· ♡ 1019 dl♡ 10
- 🤗LanguageBind/Video-LLaVA-7Bmodel· 8.0k dl· ♡ 898.0k dl♡ 89
- 🤗LanguageBind/LanguageBind_Video_mergemodel· 20k dl· ♡ 520k dl♡ 5
- 🤗LanguageBind/LanguageBind_Video_FTmodel· 5.7k dl· ♡ 75.7k dl♡ 7
- 🤗LanguageBind/LanguageBind_Audio_FTmodel· 1.7k dl· ♡ 21.7k dl♡ 2
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Cancer-related molecular mechanisms research
