Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder

Dang Jisheng (1; 2); Wu Xudong (3); Wang Bimei (4; 2); Lv Ning (1); Chen Jiayu (1); Jingwen Zhao (3); Yichu liu (5); Jizhao Liu (1); Juncheng Li (6); Teng Wang (7) ((1) Lanzhou University; (2) National University of Singapore; (3) Sun Yat-sen University; (4) Jinan University; (5) South China University of Technology; (6) Zhejiang University; (7) The University of Hong Kong )

arXiv:2506.22880·cs.CV·July 1, 2025

Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder

Dang Jisheng (1, 2), Wu Xudong (3), Wang Bimei (4, 2), Lv Ning (1), Chen Jiayu (1), Jingwen Zhao (3), Yichu liu (5), Jizhao Liu (1), Juncheng Li (6), Teng Wang (7) ((1) Lanzhou University, (2) National University of Singapore, (3) Sun Yat-sen University, (4) Jinan University

PDF

Open Access 1 Repo

TL;DR

This paper introduces DeSa2VA, a decoupling-enhanced prompting scheme that improves video segmenter and grounder performance by disentangling visual and semantic features, achieving state-of-the-art results across multiple tasks.

Contribution

The paper proposes a novel decoupling and prompting framework that enhances semantic grounding and feature disentanglement in video segmentation and grounding models.

Findings

01

Achieves state-of-the-art results on image and video segmentation tasks.

02

Effectively disentangles visual and semantic features for better reasoning.

03

Improves performance in image/video question answering tasks.

Abstract

Existing video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models. This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy. To systematically mitigate this issue, we propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2. Specifically, first, we devise a pre-training paradigm that converts textual ground-truth labels into point-level prompts while generating corresponding text masks. These masks are refined through a hybrid loss function to strengthen the model's semantic grounding capabilities. Next, we employ linear projection to disentangle hidden states that generated by a large language model into distinct textual…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

longmalongma/desa2va
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis