InterRVOS: Interaction-aware Referring Video Object Segmentation

Woojeong Jin; Seongchan Kim; Jaeho Lee; Seungryong Kim

arXiv:2506.02356·cs.CV·August 19, 2025

InterRVOS: Interaction-aware Referring Video Object Segmentation

Woojeong Jin, Seongchan Kim, Jaeho Lee, Seungryong Kim

PDF

Open Access 2 Models 1 Datasets

TL;DR

This paper introduces InterRVOS, a new task and dataset for interaction-aware video object segmentation, emphasizing the separation of actor and target objects in videos described by natural language, and proposes a novel architecture for this purpose.

Contribution

The paper proposes a new interaction-aware RVOS task, creates a large-scale dataset with interaction annotations, and develops a specialized model that improves role-specific segmentation performance.

Findings

01

ReVIOSa outperforms existing baselines on the new dataset.

02

The dataset contains over 127K annotated expressions with distinct actor and target masks.

03

The model achieves strong results on standard RVOS benchmarks.

Abstract

Referring video object segmentation (RVOS) aims to segment objects in a video described by a natural language expression. However, most existing approaches focus on segmenting only the referred object (typically the actor), even when the expression clearly describes an interaction involving multiple objects with distinct roles. For instance, "A throwing B" implies a directional interaction, but standard RVOS segments only the actor (A), neglecting other involved target objects (B). In this paper, we introduce Interaction-aware Referring Video Object Segmentation (InterRVOS), a novel task that focuses on the modeling of interactions. It requires the model to segment the actor and target objects separately, reflecting their asymmetric roles in an interaction. This task formulation enables fine-grained understanding of object relationships, as many video events are defined by such…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

wooj0216/InterRVOS-127K
dataset· 27 dl
27 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Visual Attention and Saliency Detection · Video Analysis and Summarization

MethodsFocus