TL;DR
RCoT-Seg introduces a novel video reasoning and segmentation framework that explicitly separates temporal reasoning from spatial perception, improving accuracy and robustness in complex multi-object video scenes.
Contribution
It proposes a reinforced chain-of-thought approach with a keyframe selection module and a two-stage segmentation process, enhancing holistic temporal understanding and spatial precision.
Findings
Achieves state-of-the-art performance on video reasoning and segmentation benchmarks.
Improves moment localization and inter-frame mask consistency.
Effectively separates temporal reasoning from spatial perception for better results.
Abstract
Video Reasoning Segmentation (VRS) aims to segment target objects in videos based on implicit instructions that convey human intent and temporal logic. Existing MLLM-based methods predict masks with a [SEG] token after selecting frames via simple sampling or an auxiliary MLLM, where limited supervision and frame-language similarity rules often yield narrow-scope keyframe choices that weaken holistic temporal understanding and lead to brittle localization in complex multi-object scenes. To address these issues, we introduce RCoT-Seg, a video-of-thought framework that factorizes VRS into temporal video reasoning (TVR) and keyframe target perception (KTP), explicitly separating temporal reasoning from spatial perception. Specifically, in the TVR stage, an agentic keyframe selection module, initialized with a curated CoT-start corpus and refined by GRPO under task-aligned rewards, is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
