VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation

Jihwan Hong; Jaeyoung Do

arXiv:2603.27060·cs.CV·March 31, 2026

VIRST: Video-Instructed Reasoning Assistant for SpatioTemporal Segmentation

Jihwan Hong, Jaeyoung Do

PDF

1 Repo

TL;DR

VIRST is an end-to-end video reasoning framework that improves spatio-temporal segmentation by unifying global reasoning and pixel prediction, handling dynamic motion and complex queries effectively.

Contribution

It introduces a unified model with Spatio-Temporal Fusion and Temporal Dynamic Anchor Updater for improved RVOS performance.

Findings

01

Achieves state-of-the-art results on RVOS benchmarks.

02

Effectively handles motion, occlusion, and reappearance in videos.

03

Demonstrates strong generalization to reasoning-oriented tasks.

Abstract

Referring Video Object Segmentation (RVOS) aims to segment target objects in videos based on natural language descriptions. However, fixed keyframe-based approaches that couple a vision language model with a separate propagation module often fail to capture rapidly changing spatiotemporal dynamics and to handle queries requiring multi-step reasoning, leading to sharp performance drops on motion-intensive and reasoning-oriented videos beyond static RVOS benchmarks. To address these limitations, we propose VIRST (Video-Instructed Reasoning Assistant for Spatio-Temporal Segmentation), an end-to-end framework that unifies global video reasoning and pixel-level mask prediction within a single model. VIRST bridges semantic and segmentation representations through the Spatio-Temporal Fusion (STF), which fuses segmentation-aware video features into the vision-language backbone, and employs the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

AIDASLab/VIRST
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.