SeqFormer: Sequential Transformer for Video Instance Segmentation
Junfeng Wu, Yi Jiang, Song Bai, Wenqing Zhang, Xiang Bai

TL;DR
SeqFormer introduces a novel transformer-based approach for video instance segmentation that models temporal information effectively, achieving state-of-the-art accuracy without complex tracking modules.
Contribution
It proposes a new method that uses a single instance query with independent frame attention, simplifying and improving video segmentation performance.
Findings
Achieves 47.4 AP with ResNet-50 backbone
Surpasses previous state-of-the-art by over 4 AP points
Integrating Swin transformer boosts AP to 59.3
Abstract
In this work, we present SeqFormer for video instance segmentation. SeqFormer follows the principle of vision transformer that models instance relationships among video frames. Nevertheless, we observe that a stand-alone instance query suffices for capturing a time sequence of instances in a video, but attention mechanisms shall be done with each frame independently. To achieve this, SeqFormer locates an instance in each frame and aggregates temporal information to learn a powerful representation of a video-level instance, which is used to predict the mask sequences on each frame dynamically. Instance tracking is achieved naturally without tracking branches or post-processing. On YouTube-VIS, SeqFormer achieves 47.4 AP with a ResNet-50 backbone and 49.0 AP with a ResNet-101 backbone without bells and whistles. Such achievement significantly exceeds the previous state-of-the-art…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Human Pose and Action Recognition · Advanced Image and Video Retrieval Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dense Connections · Residual Connection · Layer Normalization · Vision Transformer
