Online Reasoning Video Object Segmentation
Jinyuan Liu, Yang Wang, Zeyu Zhao, Weixin Li, Song Wang, Ruize Han

TL;DR
This paper introduces ORVOS, a new benchmark and baseline for online, causal video object segmentation from natural language queries, emphasizing real-time reasoning and referent shifts.
Contribution
It presents the ORVOSB benchmark with causal annotations and proposes a baseline model with continual prompt updates and temporal reasoning capabilities.
Findings
Existing methods perform poorly under causal constraints.
The proposed baseline improves long-horizon reasoning.
ORVOSB enables evaluation of online reasoning in video segmentation.
Abstract
Reasoning video object segmentation predicts pixel-level masks in videos from natural-language queries that may involve implicit and temporally grounded references. However, existing methods are developed and evaluated in an offline regime, where the entire video is available at inference time and future frames can be exploited for retrospective disambiguation, deviating from real-world deployments that require strictly causal, frame-by-frame decisions. We study Online Reasoning Video Object Segmentation (ORVOS), where models must incrementally interpret queries using only past and current frames without revisiting previous predictions, while handling referent shifts as events unfold. To support evaluation, we introduce ORVOSB, a benchmark with frame-level causal annotations and referent-shift labels, comprising 210 videos, 12,907 annotated frames, and 512 queries across five reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
