The Second Place Solution for The 4th Large-scale Video Object Segmentation Challenge--Track 3: Referring Video Object Segmentation
Leilei Cao, Zhuang Li, Bo Yan, Feng Zhang, Fengliang Qi, Yuchen Hu and, Hongbin Wang

TL;DR
This paper presents a second-place solution for the RVOS challenge, enhancing the ReferFormer framework with training tricks and inference strategies to improve video object segmentation guided by language expressions.
Contribution
The work introduces specific training and inference techniques to significantly boost the performance of the ReferFormer model for RVOS.
Findings
Achieved second place in CVPR2022 Referring Youtube-VOS Challenge.
Implemented cyclical learning rates, semi-supervised training, and test-time augmentation.
Demonstrated improved segmentation accuracy over baseline models.
Abstract
The referring video object segmentation task (RVOS) aims to segment object instances in a given video referred by a language expression in all video frames. Due to the requirement of understanding cross-modal semantics within individual instances, this task is more challenging than the traditional semi-supervised video object segmentation where the ground truth object masks in the first frame are given. With the great achievement of Transformer in object detection and object segmentation, RVOS has been made remarkable progress where ReferFormer achieved the state-of-the-art performance. In this work, based on the strong baseline framework--ReferFormer, we propose several tricks to boost further, including cyclical learning rates, semi-supervised approach, and test-time augmentation inference. The improved ReferFormer ranks 2nd place on CVPR2022 Referring Youtube-VOS Challenge.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Neural Network Applications · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Residual Connection · Adam · Byte Pair Encoding · Layer Normalization · Absolute Position Encodings · Label Smoothing
