Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

Guang Feng; Lihe Zhang; Zhiwei Hu; Huchuan Lu

arXiv:2203.15969·cs.CV·March 31, 2022·1 cites

Deeply Interleaved Two-Stream Encoder for Referring Video Segmentation

Guang Feng, Lihe Zhang, Zhiwei Hu, Huchuan Lu

PDF

Open Access

TL;DR

This paper introduces a novel deep interleaved two-stream encoder with a vision-language mutual guidance module and a language-guided multi-scale dynamic filtering module, significantly improving referring video segmentation by better multi-modal fusion and temporal coherence.

Contribution

It proposes a new hierarchical two-stream encoder with mutual guidance and dynamic filtering modules for enhanced multi-modal fusion and temporal alignment in referring video segmentation.

Findings

01

Outperforms existing methods on four benchmark datasets.

02

Demonstrates improved multi-modal feature fusion and temporal coherence.

03

Achieves state-of-the-art segmentation accuracy.

Abstract

Referring video segmentation aims to segment the corresponding video object described by the language expression. To address this task, we first design a two-stream encoder to extract CNN-based visual features and transformer-based linguistic features hierarchically, and a vision-language mutual guidance (VLMG) module is inserted into the encoder multiple times to promote the hierarchical and progressive fusion of multi-modal features. Compared with the existing multi-modal fusion methods, this two-stream encoder takes into account the multi-granularity linguistic context, and realizes the deep interleaving between modalities with the help of VLGM. In order to promote the temporal alignment between frames, we further propose a language-guided multi-scale dynamic filtering (LMDF) module to strengthen the temporal coherence, which uses the language-guided spatial-temporal features to…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning