RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

Junwei Wen; Deshui Miao; Guangming Lu; Xin Li; Wenjie Pei

arXiv:2605.07334·cs.CV·May 11, 2026

RCoT-Seg: Reinforced Chain-of-Thought for Video Reasoning and Segmentation

Junwei Wen, Deshui Miao, Guangming Lu, Xin Li, Wenjie Pei

PDF

1 Repo

TL;DR

RCoT-Seg introduces a novel video reasoning and segmentation framework that explicitly separates temporal reasoning from spatial perception, improving accuracy and robustness in complex multi-object video scenes.

Contribution

It proposes a reinforced chain-of-thought approach with a keyframe selection module and a two-stage segmentation process, enhancing holistic temporal understanding and spatial precision.

Findings

01

Achieves state-of-the-art performance on video reasoning and segmentation benchmarks.

02

Improves moment localization and inter-frame mask consistency.

03

Effectively separates temporal reasoning from spatial perception for better results.

Abstract

Video Reasoning Segmentation (VRS) aims to segment target objects in videos based on implicit instructions that convey human intent and temporal logic. Existing MLLM-based methods predict masks with a [SEG] token after selecting frames via simple sampling or an auxiliary MLLM, where limited supervision and frame-language similarity rules often yield narrow-scope keyframe choices that weaken holistic temporal understanding and lead to brittle localization in complex multi-object scenes. To address these issues, we introduce RCoT-Seg, a video-of-thought framework that factorizes VRS into temporal video reasoning (TVR) and keyframe target perception (KTP), explicitly separating temporal reasoning from spatial perception. Specifically, in the TVR stage, an agentic keyframe selection module, initialized with a curated CoT-start corpus and refined by GRPO under task-aligned rewards, is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Victor-wjw/RCoT-Seg
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.