Video Object Segmentation with Language Referring Expressions
Anna Khoreva, Anna Rohrbach, and Bernt Schiele

TL;DR
This paper introduces a novel video object segmentation method that uses language referring expressions instead of pixel masks, making the process more practical, robust, and less costly while maintaining competitive performance.
Contribution
It extends language grounding models to video data for object segmentation, enabling language-based identification and tracking of objects in videos.
Findings
Performs on par with mask-based methods on DAVIS'16.
Competitive with scribble-based methods on DAVIS'17.
Augments benchmarks with language descriptions for evaluation.
Abstract
Most state-of-the-art semi-supervised video object segmentation methods rely on a pixel-accurate mask of a target object provided for the first frame of a video. However, obtaining a detailed segmentation mask is expensive and time-consuming. In this work we explore an alternative way of identifying a target object, namely by employing language referring expressions. Besides being a more practical and natural way of pointing out a target object, using language specifications can help to avoid drift as well as make the system more robust to complex dynamics and appearance variations. Leveraging recent advances of language grounding models designed for images, we propose an approach to extend them to video data, ensuring temporally coherent predictions. To evaluate our method we augment the popular video object segmentation benchmarks, DAVIS'16 and DAVIS'17 with language descriptions of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
