VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

Zishan Xu; Yifu Guo; Yuquan Lu; Fengyu Yang; Junxin Li

arXiv:2511.16077·cs.CV·November 21, 2025

VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning

Zishan Xu, Yifu Guo, Yuquan Lu, Fengyu Yang, Junxin Li

PDF

Open Access 1 Video

TL;DR

VideoSeg-R1 introduces a reinforcement learning framework for video object segmentation that enhances generalization and explicit reasoning, outperforming existing methods on multiple benchmarks.

Contribution

It is the first to incorporate reinforcement learning into video reasoning segmentation with a decoupled architecture and adaptive reasoning control.

Findings

01

Achieves state-of-the-art results on multiple benchmarks.

02

Effectively models explicit reasoning chains.

03

Improves generalization to out-of-distribution scenarios.

Abstract

Traditional video reasoning segmentation methods rely on supervised fine-tuning, which limits generalization to out-of-distribution scenarios and lacks explicit reasoning. To address this, we propose \textbf{VideoSeg-R1}, the first framework to introduce reinforcement learning into video reasoning segmentation. It adopts a decoupled architecture that formulates the task as joint referring image segmentation and video mask propagation. It comprises three stages: (1) A hierarchical text-guided frame sampler to emulate human attention; (2) A reasoning model that produces spatial cues along with explicit reasoning chains; and (3) A segmentation-propagation stage using SAM2 and XMem. A task difficulty-aware mechanism adaptively controls reasoning length for better efficiency and accuracy. Extensive evaluations on multiple benchmarks demonstrate that VideoSeg-R1 achieves state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

VideoSeg-R1:Reasoning Video Object Segmentation via Reinforcement Learning· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Human Pose and Action Recognition