Decoupled Seg Tokens Make Stronger Reasoning Video Segmenter and Grounder
Dang Jisheng (1, 2), Wu Xudong (3), Wang Bimei (4, 2), Lv Ning (1), Chen Jiayu (1), Jingwen Zhao (3), Yichu liu (5), Jizhao Liu (1), Juncheng Li (6), Teng Wang (7) ((1) Lanzhou University, (2) National University of Singapore, (3) Sun Yat-sen University, (4) Jinan University

TL;DR
This paper introduces DeSa2VA, a decoupling-enhanced prompting scheme that improves video segmenter and grounder performance by disentangling visual and semantic features, achieving state-of-the-art results across multiple tasks.
Contribution
The paper proposes a novel decoupling and prompting framework that enhances semantic grounding and feature disentanglement in video segmentation and grounding models.
Findings
Achieves state-of-the-art results on image and video segmentation tasks.
Effectively disentangles visual and semantic features for better reasoning.
Improves performance in image/video question answering tasks.
Abstract
Existing video segmenter and grounder approaches, exemplified by Sa2VA, directly fuse features within segmentation models. This often results in an undesirable entanglement of dynamic visual information and static semantics, thereby degrading segmentation accuracy. To systematically mitigate this issue, we propose DeSa2VA, a decoupling-enhanced prompting scheme integrating text pre-training and a linear decoupling module to address the information processing limitations inherent in SAM-2. Specifically, first, we devise a pre-training paradigm that converts textual ground-truth labels into point-level prompts while generating corresponding text masks. These masks are refined through a hybrid loss function to strengthen the model's semantic grounding capabilities. Next, we employ linear projection to disentangle hidden states that generated by a large language model into distinct textual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Generative Adversarial Networks and Image Synthesis
