Video-LLaVA: Learning United Visual Representation by Alignment Before   Projection

Bin Lin; Yang Ye; Bin Zhu; Jiaxi Cui; Munan Ning; Peng Jin; Li Yuan

arXiv:2311.10122·cs.CV·October 2, 2024·20 cites

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, Li Yuan

PDF

Open Access 5 Repos 10 Models 1 Datasets 1 Video

TL;DR

Video-LLaVA introduces a unified visual representation approach for large vision-language models, enabling better multi-modal understanding and outperforming existing models on various image and video benchmarks.

Contribution

The paper proposes a unified visual representation for images and videos in LVLMs, improving multi-modal learning and performance across multiple benchmarks.

Findings

01

Outperforms Video-ChatGPT on several video benchmarks

02

Achieves superior results on 9 image question-answering datasets

03

Mutually benefits image and video understanding within a unified model

Abstract

The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

LanguageBind/MoE-LLaVA
dataset· 420 dl
420 dl

Videos

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Image and Video Retrieval Techniques