Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang

TL;DR
Sa2VA is a unified multi-modal model that integrates SAM2 and LLaVA to enable dense, grounded understanding of images and videos, supporting diverse tasks with minimal instruction tuning.
Contribution
It introduces Sa2VA, the first model to unify static and dynamic visual understanding using a shared LLM token space, combining foundation video segmentation with vision-language modeling.
Findings
Achieves strong performance in referring video object segmentation
Supports a wide range of image and video tasks with minimal tuning
Provides a new dataset, Ref-SAV, with over 72k object expressions
Abstract
This work presents Sa2VA, the first comprehensive, unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with MLLM, the advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗ByteDance/Sa2VA-4Bmodel· 4.9k dl· ♡ 964.9k dl♡ 96
- 🤗ByteDance/Sa2VA-8Bmodel· 951 dl· ♡ 65951 dl♡ 65
- 🤗ByteDance/Sa2VA-1Bmodel· 660 dl· ♡ 29660 dl♡ 29
- 🤗ByteDance/Sa2VA-26Bmodel· 114 dl· ♡ 31114 dl♡ 31
- 🤗kumuji/Sa2VA-i-4Bmodel· 7 dl7 dl
- 🤗kumuji/Sa2VA-i-1Bmodel· 30 dl30 dl
- 🤗kumuji/Sa2VA-i-8Bmodel· 2 dl2 dl
- 🤗ByteDance/Sa2VA-InternVL3-2Bmodel· 246 dl· ♡ 1246 dl♡ 1
- 🤗ByteDance/Sa2VA-InternVL3-8Bmodel· 77 dl· ♡ 477 dl♡ 4
- 🤗ByteDance/Sa2VA-InternVL3-14Bmodel· 43 dl· ♡ 943 dl♡ 9
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · COVID-19 diagnosis using AI
