Video-LLaVA: Learning United Visual Representation by Alignment Before Projection
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan

TL;DR
Video-LLaVA introduces a unified visual representation approach for large vision-language models, enabling better multi-modal understanding and outperforming existing models on various image and video benchmarks.
Contribution
The paper proposes a unified visual representation for images and videos in LVLMs, improving multi-modal learning and performance across multiple benchmarks.
Findings
Outperforms Video-ChatGPT on several video benchmarks
Achieves superior results on 9 image question-answering datasets
Mutually benefits image and video understanding within a unified model
Abstract
The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗LanguageBind/Video-LLaVA-Pretrain-7Bmodel· 19 dl· ♡ 1019 dl♡ 10
- 🤗LanguageBind/Video-LLaVA-7Bmodel· 8.0k dl· ♡ 898.0k dl♡ 89
- 🤗pandalla/MBTIGPT_zh_INFPmodel· 4 dl4 dl
- 🤗pandalla/MBTIGPT_zh_INTPmodel· 7 dl· ♡ 67 dl♡ 6
- 🤗pandalla/MBTIGPT_zh_ENFJmodel· 1 dl1 dl
- 🤗pandalla/MBTIGPT_zh_ENFPmodel· 2 dl2 dl
- 🤗pandalla/MBTIGPT_zh_ENTJmodel· 2 dl2 dl
- 🤗pandalla/MBTIGPT_zh_ENTPmodel· 2 dl2 dl
- 🤗pandalla/MBTIGPT_zh_ESFJmodel· 3 dl3 dl
- 🤗pandalla/MBTIGPT_zh_ESFPmodel· 2 dl· ♡ 12 dl♡ 1
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques
