Reinforcing Video Reasoning Segmentation to Think Before It Segments
Sitong Gong, Lu Zhang, Yunzhi Zhuge, Xu Jia, Pingping Zhang, Huchuan Lu

TL;DR
This paper introduces Veason-R1, a reinforcement learning-based LVLM designed for video reasoning segmentation, which improves interpretability, spatiotemporal reasoning, and achieves state-of-the-art results on multiple benchmarks.
Contribution
The paper presents Veason-R1, a novel LVLM for VRS that incorporates structured reasoning via Chain-of-Thought training and Group Relative Policy Optimization, enhancing performance and interpretability.
Findings
Achieves +1.3 J &F on ReVOS and +10.0 J &F on ReasonVOS benchmarks.
Demonstrates robustness to hallucinations with +8.8 R improvement.
Outperforms prior methods significantly in video reasoning segmentation.
Abstract
Video reasoning segmentation (VRS) endeavors to delineate referred objects in videos guided by implicit instructions that encapsulate human intent and temporal logic. Previous approaches leverage large vision language models (LVLMs) to encode object semantics into <SEG> tokens for mask prediction. However, this paradigm suffers from limited interpretability during inference and suboptimal performance due to inadequate spatiotemporal reasoning. Drawing inspiration from seminal breakthroughs in reinforcement learning, we introduce Veason-R1, a specialized LVLM for VRS that emphasizes structured reasoning in segmentation. Veason-R1 is trained through Group Relative Policy Optimization (GRPO) augmented with Chain-of-Thought (CoT) initialization. To begin with, we curate high-quality CoT training data to instill structured reasoning trajectories, bridging video-level semantics and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Reinforcement Learning in Robotics · Explainable Artificial Intelligence (XAI)
