ReVSeg: Incentivizing the Reasoning Chain for Video Segmentation with Reinforcement Learning
Yifan Li, Yingda Yin, Lingting Zhu, Weikai Chen, Shengju Qian, Xin Wang, Yanwei Fu

TL;DR
ReVSeg introduces a reinforcement learning-based approach that explicitly decomposes reasoning into semantics, temporal evidence, and spatial grounding for video segmentation, achieving state-of-the-art results and interpretable reasoning chains.
Contribution
The paper presents ReVSeg, a novel method that explicitly models reasoning steps in video segmentation using pretrained vision-language models and reinforcement learning.
Findings
Achieves state-of-the-art performance on video segmentation benchmarks.
Provides interpretable reasoning trajectories.
Self-refines decision quality through reinforcement learning.
Abstract
Reasoning-centric video object segmentation is an inherently complex task: the query often refers to dynamics, causality, and temporal interactions, rather than static appearances. Yet existing solutions generally collapse these factors into simplified reasoning with latent embeddings, rendering the reasoning chain opaque and essentially intractable. We therefore adopt an explicit decomposition perspective and introduce ReVSeg, which executes reasoning as sequential decisions in the native interface of pretrained vision language models (VLMs). Rather than folding all reasoning into a single-step prediction, ReVSeg executes three explicit operations -- semantics interpretation, temporal evidence selection, and spatial grounding -- aligning pretrained capabilities. We further employ reinforcement learning to optimize the multi-step reasoning chain, enabling the model to self-refine its…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Explainable Artificial Intelligence (XAI)
