TL;DR
Deformable VisTR introduces a spatio-temporal deformable attention mechanism to improve training efficiency and reduce computational costs in video instance segmentation, achieving comparable performance with significantly less training time.
Contribution
We propose Deformable VisTR, a novel transformer-based framework that uses deformable attention for efficient and effective video instance segmentation.
Findings
Achieves linear computation in spatio-temporal feature maps.
Requires 10 times less GPU training hours than original VisTR.
Performs on par with state-of-the-art methods on Youtube-VIS benchmark.
Abstract
Video instance segmentation (VIS) task requires classifying, segmenting, and tracking object instances over all frames in a video clip. Recently, VisTR has been proposed as end-to-end transformer-based VIS framework, while demonstrating state-of-the-art performance. However, VisTR is slow to converge during training, requiring around 1000 GPU hours due to the high computational cost of its transformer attention module. To improve the training efficiency, we propose Deformable VisTR, leveraging spatio-temporal deformable attention module that only attends to a small fixed set of key spatio-temporal sampling points around a reference point. This enables Deformable VisTR to achieve linear computation in the size of spatio-temporal feature maps. Moreover, it can achieve on par performance as the original VisTR with 10 less GPU training hours. We validate the effectiveness of our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsVisTR · Deformable Attention Module
