SeqFormer: Sequential Transformer for Video Instance Segmentation

Junfeng Wu; Yi Jiang; Song Bai; Wenqing Zhang; Xiang Bai

arXiv:2112.08275·cs.CV·July 22, 2022·6 cites

SeqFormer: Sequential Transformer for Video Instance Segmentation

Junfeng Wu, Yi Jiang, Song Bai, Wenqing Zhang, Xiang Bai

PDF

Open Access 2 Repos

TL;DR

SeqFormer introduces a novel transformer-based approach for video instance segmentation that models temporal information effectively, achieving state-of-the-art accuracy without complex tracking modules.

Contribution

It proposes a new method that uses a single instance query with independent frame attention, simplifying and improving video segmentation performance.

Findings

01

Achieves 47.4 AP with ResNet-50 backbone

02

Surpasses previous state-of-the-art by over 4 AP points

03

Integrating Swin transformer boosts AP to 59.3

Abstract

In this work, we present SeqFormer for video instance segmentation. SeqFormer follows the principle of vision transformer that models instance relationships among video frames. Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms shall be done with each frame independently. To achieve this, SeqFormer locates an instance in each frame and aggregates temporal information to learn a powerful representation of a video-level instance, which is used to predict the mask sequences on each frame dynamically. Instance tracking is achieved naturally without tracking branches or post-processing. On YouTube-VIS, SeqFormer achieves 47.4 AP with a ResNet-50 backbone and 49.0 AP with a ResNet-101 backbone without bells and whistles. Such achievement significantly exceeds the previous state-of-the-art…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Residual Connection · Layer Normalization · Vision Transformer