InterRVOS: Interaction-aware Referring Video Object Segmentation
Woojeong Jin, Seongchan Kim, Jaeho Lee, Seungryong Kim

TL;DR
This paper introduces InterRVOS, a new task and dataset for interaction-aware video object segmentation, emphasizing the separation of actor and target objects in videos described by natural language, and proposes a novel architecture for this purpose.
Contribution
The paper proposes a new interaction-aware RVOS task, creates a large-scale dataset with interaction annotations, and develops a specialized model that improves role-specific segmentation performance.
Findings
ReVIOSa outperforms existing baselines on the new dataset.
The dataset contains over 127K annotated expressions with distinct actor and target masks.
The model achieves strong results on standard RVOS benchmarks.
Abstract
Referring video object segmentation (RVOS) aims to segment objects in a video described by a natural language expression. However, most existing approaches focus on segmenting only the referred object (typically the actor), even when the expression clearly describes an interaction involving multiple objects with distinct roles. For instance, "A throwing B" implies a directional interaction, but standard RVOS segments only the actor (A), neglecting other involved target objects (B). In this paper, we introduce Interaction-aware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on the modeling of interactions. It requires the model to segment the actor and target objects separately, reflecting their asymmetric roles in an interaction. This task formulation enables fine-grained understanding of object relationships, as many video events are defined by such…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Video Analysis and Summarization
MethodsFocus
