Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation

Yichen Yuan; Yifan Wang; Lijun Wang; Xiaoqi Zhao; Huchuan Lu; Yu Wang,; Weibo Su; Lei Zhang

arXiv:2308.06693·cs.CV·August 15, 2023·1 cites

Isomer: Isomerous Transformer for Zero-shot Video Object Segmentation

Yichen Yuan, Yifan Wang, Lijun Wang, Xiaoqi Zhao, Huchuan Lu, Yu Wang,, Weibo Su, Lei Zhang

PDF

Open Access 1 Repo 1 Models

TL;DR

Isomer introduces a novel Transformer-based framework for zero-shot video object segmentation that improves efficiency and accuracy by employing stage-specific Transformer variants, achieving state-of-the-art results.

Contribution

The paper proposes a level-isomerous Transformer framework with CST and SGST variants tailored for different feature stages, significantly enhancing speed and performance in ZVOS.

Findings

01

Speed increased by 13 times over baseline.

02

Achieved new state-of-the-art ZVOS performance.

03

Efficient modeling of stage-specific dependencies.

Abstract

Recent leading zero-shot video object segmentation (ZVOS) works devote to integrating appearance and motion information by elaborately designing feature fusion modules and identically applying them in multiple feature stages. Our preliminary experiments show that with the strong long-range dependency modeling capacity of Transformer, simply concatenating the two modality features and feeding them to vanilla Transformers for feature fusion can distinctly benefit the performance but at a cost of heavy computation. Through further empirical analysis, we find that attention dependencies learned in Transformer in different stages exhibit completely different properties: global query-independent dependency in the low-level stages and semantic-specific dependency in the high-level stages. Motivated by the observations, we propose two Transformer variants: i) Context-Sharing Transformer (CST)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

dlut-yyc/isomer
pytorchOfficial

Models

🤗
divenyuan/Isomer
model

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVisual Attention and Saliency Detection · Advanced Neural Network Applications · Domain Adaptation and Few-Shot Learning

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Layer Normalization · Adam · Softmax · Label Smoothing · Position-Wise Feed-Forward Layer · Residual Connection