End-to-End Video Instance Segmentation with Transformers
Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng,, Hao Shen, Huaxia Xia

TL;DR
VisTR introduces an end-to-end Transformer-based framework for video instance segmentation, simplifying the pipeline and achieving state-of-the-art speed and competitive accuracy on the YouTube-VIS dataset.
Contribution
It presents the first simple, fast, end-to-end Transformer-based approach for VIS, framing segmentation and tracking as a sequence prediction problem.
Findings
Achieves the highest speed among VIS models.
Attains the best results among single-model methods on YouTube-VIS.
Demonstrates a simpler, faster framework with competitive accuracy.
Abstract
Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video. Recent methods typically develop sophisticated pipelines to tackle this task. Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem. Given a video clip consisting of multiple image frames as input, VisTR outputs the sequence of masks for each instance in the video in order directly. At the core is a new, effective instance sequence matching and segmentation strategy, which supervises and segments instances at the sequence level as a whole. VisTR frames the instance segmentation and tracking in the same perspective of similarity learning, thus considerably simplifying the overall pipeline and is…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Advanced Vision and Imaging
MethodsVisTR
