Rethinking Cross-modal Interaction from a Top-down Perspective for   Referring Video Object Segmentation

Chen Liang; Yu Wu; Tianfei Zhou; Wenguan Wang; Zongxin Yang; Yunchao; Wei; Yi Yang

arXiv:2106.01061·cs.CV·January 22, 2024·32 cites

Rethinking Cross-modal Interaction from a Top-down Perspective for Referring Video Object Segmentation

Chen Liang, Yu Wu, Tianfei Zhou, Wenguan Wang, Zongxin Yang, Yunchao, Wei, Yi Yang

PDF

Open Access

TL;DR

This paper introduces a top-down approach for referring video object segmentation that constructs object tracklets and employs a Transformer-based grounding module, achieving state-of-the-art results on a major benchmark.

Contribution

It proposes a novel two-stage top-down method with object tracklets and a Transformer-based grounding module, improving over traditional bottom-up strategies.

Findings

01

Achieved first place on CVPR2021 Referring Youtube-VOS challenge.

02

Outperformed previous bottom-up methods in RVOS tasks.

03

Demonstrated effectiveness of object-level cues and Transformer-based modeling.

Abstract

Referring video object segmentation (RVOS) aims to segment video objects with the guidance of natural language reference. Previous methods typically tackle RVOS through directly grounding linguistic reference over the image lattice. Such bottom-up strategy fails to explore object-level cues, easily leading to inferior results. In this work, we instead put forward a two-stage, top-down RVOS solution. First, an exhaustive set of object tracklets is constructed by propagating object masks detected from several sampled frames to the entire video. Second, a Transformer-based tracklet-language grounding module is proposed, which models instance-level visual relations and cross-modal interactions simultaneously and efficiently. Our model ranks first place on CVPR2021 Referring Youtube-VOS challenge.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications