Learning Cross-Modal Affinity for Referring Video Object Segmentation   Targeting Limited Samples

Guanghui Li; Mingqi Gao; Heng Liu; Xiantong Zhen; Feng Zheng

arXiv:2309.02041·cs.CV·September 6, 2023

Learning Cross-Modal Affinity for Referring Video Object Segmentation Targeting Limited Samples

Guanghui Li, Mingqi Gao, Heng Liu, Xiantong Zhen, Feng Zheng

PDF

Open Access 1 Repo

TL;DR

This paper introduces a novel cross-modal affinity Transformer model for few-shot referring video object segmentation, enabling effective adaptation to new scenes with minimal annotations and establishing a new benchmark for the task.

Contribution

The paper proposes a cross-modal affinity module within a Transformer architecture for few-shot RVOS and creates a new benchmark to evaluate such methods.

Findings

01

Achieves state-of-the-art performance on the new FS-RVOS benchmark.

02

Outperforms baselines by 10% on Mini-Ref-YouTube-VOS.

03

Significantly better results on Mini-Ref-SAIL-VOS.

Abstract

Referring video object segmentation (RVOS), as a supervised learning task, relies on sufficient annotated data for a given scene. However, in more realistic scenarios, only minimal annotations are available for a new scene, which poses significant challenges to existing RVOS methods. With this in mind, we propose a simple yet effective model with a newly designed cross-modal affinity (CMA) module based on a Transformer architecture. The CMA module builds multimodal affinity with a few samples, thus quickly learning new semantic information, and enabling the model to adapt to different scenarios. Since the proposed method targets limited samples for new scenes, we generalize the problem as - few-shot referring video object segmentation (FS-RVOS). To foster research in this direction, we build up a new FS-RVOS benchmark based on currently available datasets. The benchmark covers a wide…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

hengliusky/few_shot_rvos
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Advanced Neural Network Applications

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Residual Connection · Adam · Byte Pair Encoding · Softmax · Dropout · Label Smoothing · Absolute Position Encodings