End-to-End Video Instance Segmentation with Transformers

Yuqing Wang; Zhaoliang Xu; Xinlong Wang; Chunhua Shen; Baoshan Cheng,; Hao Shen; Huaxia Xia

arXiv:2011.14503·cs.CV·October 11, 2021·66 cites

End-to-End Video Instance Segmentation with Transformers

Yuqing Wang, Zhaoliang Xu, Xinlong Wang, Chunhua Shen, Baoshan Cheng,, Hao Shen, Huaxia Xia

PDF

Open Access 2 Repos

TL;DR

VisTR introduces an end-to-end Transformer-based framework for video instance segmentation, simplifying the pipeline and achieving state-of-the-art speed and competitive accuracy on the YouTube-VIS dataset.

Contribution

It presents the first simple, fast, end-to-end Transformer-based approach for VIS, framing segmentation and tracking as a sequence prediction problem.

Findings

01

Achieves the highest speed among VIS models.

02

Attains the best results among single-model methods on YouTube-VIS.

03

Demonstrates a simpler, faster framework with competitive accuracy.

Abstract

Video instance segmentation (VIS) is the task that requires simultaneously classifying, segmenting and tracking object instances of interest in video. Recent methods typically develop sophisticated pipelines to tackle this task. Here, we propose a new video instance segmentation framework built upon Transformers, termed VisTR, which views the VIS task as a direct end-to-end parallel sequence decoding/prediction problem. Given a video clip consisting of multiple image frames as input, VisTR outputs the sequence of masks for each instance in the video in order directly. At the core is a new, effective instance sequence matching and segmentation strategy, which supervises and segments instances at the sequence level as a whole. VisTR frames the instance segmentation and tracking in the same perspective of similarity learning, thus considerably simplifying the overall pipeline and is…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Advanced Vision and Imaging

MethodsVisTR