TL;DR
VIRST is an end-to-end video reasoning framework that improves spatio-temporal segmentation by unifying global reasoning and pixel prediction, handling dynamic motion and complex queries effectively.
Contribution
It introduces a unified model with Spatio-Temporal Fusion and Temporal Dynamic Anchor Updater for improved RVOS performance.
Findings
Achieves state-of-the-art results on RVOS benchmarks.
Effectively handles motion, occlusion, and reappearance in videos.
Demonstrates strong generalization to reasoning-oriented tasks.
Abstract
Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe-based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries requiring multi-step reasoning, leading to sharp performance drops on motion-intensive and reasoning-oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (Video-Instructed Reasoning Assistant for Spatio-Temporal Segmentation), an end-to-end framework that unifies global video reasoning and pixel-level mask prediction within a single model. VIRST bridges semantic and segmentation representations through the Spatio-Temporal Fusion (STF), which fuses segmentation-aware video features into the vision-language backbone, and employs the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
