RefVOS: A Closer Look at Referring Expressions for Video Object Segmentation
Miriam Bellver, Carles Ventura, Carina Silberer, Ioannis Kazakos,, Jordi Torres, Xavier Giro-i-Nieto

TL;DR
This paper introduces RefVOS, a neural network for language-guided video object segmentation, highlighting the importance of non-trivial referring expressions and analyzing challenges related to motion and actions.
Contribution
The work provides a new categorization of referring expressions and demonstrates RefVOS's effectiveness, revealing key challenges in understanding motion and static actions.
Findings
RefVOS achieves state-of-the-art results in language-guided VOS.
Non-trivial REs are more challenging and reveal limitations in current models.
Understanding motion and static actions are major challenges for the task.
Abstract
The task of video object segmentation with referring expressions (language-guided VOS) is to, given a linguistic phrase and a video, generate binary masks for the object to which the phrase refers. Our work argues that existing benchmarks used for this task are mainly composed of trivial cases, in which referents can be identified with simple phrases. Our analysis relies on a new categorization of the phrases in the DAVIS-2017 and Actor-Action datasets into trivial and non-trivial REs, with the non-trivial REs annotated with seven RE semantic categories. We leverage this data to analyze the results of RefVOS, a novel neural network that obtains competitive results for the task of language-guided image segmentation and state of the art results for language-guided VOS. Our study indicates that the major challenges for the task are related to understanding motion and static actions.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Pose and Action Recognition · Video Analysis and Summarization
MethodsLinear Layer · VOS · 1x1 Convolution · Convolution · Residual Connection · Weight Decay · Attention Dropout · Linear Warmup With Linear Decay · WordPiece · Adam
