Deformable VisTR: Spatio temporal deformable attention for video   instance segmentation

Sudhir Yarram; Jialian Wu; Pan Ji; Yi Xu; Junsong Yuan

arXiv:2203.06318·cs.CV·March 15, 2022

Deformable VisTR: Spatio temporal deformable attention for video instance segmentation

Sudhir Yarram, Jialian Wu, Pan Ji, Yi Xu, Junsong Yuan

PDF

1 Repo

TL;DR

Deformable VisTR introduces a spatio-temporal deformable attention mechanism to improve training efficiency and reduce computational costs in video instance segmentation, achieving comparable performance with significantly less training time.

Contribution

We propose Deformable VisTR, a novel transformer-based framework that uses deformable attention for efficient and effective video instance segmentation.

Findings

01

Achieves linear computation in spatio-temporal feature maps.

02

Requires 10 times less GPU training hours than original VisTR.

03

Performs on par with state-of-the-art methods on Youtube-VIS benchmark.

Abstract

Video instance segmentation (VIS) task requires classifying, segmenting, and tracking object instances over all frames in a video clip. Recently, VisTR has been proposed as end-to-end transformer-based VIS framework, while demonstrating state-of-the-art performance. However, VisTR is slow to converge during training, requiring around 1000 GPU hours due to the high computational cost of its transformer attention module. To improve the training efficiency, we propose Deformable VisTR, leveraging spatio-temporal deformable attention module that only attends to a small fixed set of key spatio-temporal sampling points around a reference point. This enables Deformable VisTR to achieve linear computation in the size of spatio-temporal feature maps. Moreover, it can achieve on par performance as the original VisTR with 10 $\times$ less GPU training hours. We validate the effectiveness of our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

skrya/defvis
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsVisTR · Deformable Attention Module