Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Haobo Yuan; Xiangtai Li; Tao Zhang; Yueyi Sun; Zilong Huang; Shilin Xu; Shunping Ji; Yunhai Tong; Lu Qi; Jiashi Feng; Ming-Hsuan Yang

arXiv:2501.04001·cs.CV·November 4, 2025·2 cites

Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos

Haobo Yuan, Xiangtai Li, Tao Zhang, Yueyi Sun, Zilong Huang, Shilin Xu, Shunping Ji, Yunhai Tong, Lu Qi, Jiashi Feng, Ming-Hsuan Yang

PDF

Open Access 1 Repo 10 Models 3 Datasets

TL;DR

Sa2VA is a unified multi-modal model that integrates SAM2 and LLaVA to enable dense, grounded understanding of images and videos, supporting diverse tasks with minimal instruction tuning.

Contribution

It introduces Sa2VA, the first model to unify static and dynamic visual understanding using a shared LLM token space, combining foundation video segmentation with vision-language modeling.

Findings

01

Achieves strong performance in referring video object segmentation

02

Supports a wide range of image and video tasks with minimal tuning

03

Provides a new dataset, Ref-SAV, with over 72k object expressions

Abstract

This work presents Sa2VA, the first comprehensive, unified model for dense grounded understanding of both images and videos. Unlike existing multi-modal large language models, which are often limited to specific modalities and tasks, Sa2VA supports a wide range of image and video tasks, including referring segmentation and conversation, with minimal one-shot instruction tuning. Sa2VA combines SAM-2, a foundation video segmentation model, with MLLM, the advanced vision-language model, and unifies text, image, and video into a shared LLM token space. Using the LLM, Sa2VA generates instruction tokens that guide SAM-2 in producing precise masks, enabling a grounded, multi-modal understanding of both static and dynamic visual content. Additionally, we introduce Ref-SAV, an auto-labeled dataset containing over 72k object expressions in complex video scenes, designed to boost model…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

magic-research/Sa2VA
pytorchOfficial

Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · COVID-19 diagnosis using AI