The Second Place Solution for The 4th Large-scale Video Object   Segmentation Challenge--Track 3: Referring Video Object Segmentation

Leilei Cao; Zhuang Li; Bo Yan; Feng Zhang; Fengliang Qi; Yuchen Hu and; Hongbin Wang

arXiv:2206.12035·cs.CV·June 27, 2022

The Second Place Solution for The 4th Large-scale Video Object Segmentation Challenge--Track 3: Referring Video Object Segmentation

Leilei Cao, Zhuang Li, Bo Yan, Feng Zhang, Fengliang Qi, Yuchen Hu and, Hongbin Wang

PDF

Open Access

TL;DR

This paper presents a second-place solution for the RVOS challenge, enhancing the ReferFormer framework with training tricks and inference strategies to improve video object segmentation guided by language expressions.

Contribution

The work introduces specific training and inference techniques to significantly boost the performance of the ReferFormer model for RVOS.

Findings

01

Achieved second place in CVPR2022 Referring Youtube-VOS Challenge.

02

Implemented cyclical learning rates, semi-supervised training, and test-time augmentation.

03

Demonstrated improved segmentation accuracy over baseline models.

Abstract

The referring video object segmentation task (RVOS) aims to segment object instances in a given video referred by a language expression in all video frames. Due to the requirement of understanding cross-modal semantics within individual instances, this task is more challenging than the traditional semi-supervised video object segmentation where the ground truth object masks in the first frame are given. With the great achievement of Transformer in object detection and object segmentation, RVOS has been made remarkable progress where ReferFormer achieved the state-of-the-art performance. In this work, based on the strong baseline framework--ReferFormer, we propose several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference. The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Visual Attention and Saliency Detection

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Residual Connection · Adam · Byte Pair Encoding · Layer Normalization · Absolute Position Encodings · Label Smoothing