Tag-Based Attention Guided Bottom-Up Approach for Video Instance Segmentation
Jyoti Kini, Mubarak Shah

TL;DR
This paper introduces a novel end-to-end bottom-up video instance segmentation method that uses tag assignment and a spatio-temporal tagging loss, processing entire video clips as 3D volumes for improved temporal consistency and efficiency.
Contribution
It proposes a new spatio-temporal tagging loss and a tag-based attention module for video instance segmentation, enabling end-to-end training and better temporal propagation.
Findings
Achieves competitive results on YouTube-VIS and DAVIS-19 datasets.
Offers a more efficient, end-to-end approach compared to multi-stage methods.
Demonstrates effective separation and tracking of object instances across videos.
Abstract
Video Instance Segmentation is a fundamental computer vision task that deals with segmenting and tracking object instances across a video sequence. Most existing methods typically accomplish this task by employing a multi-stage top-down approach that usually involves separate networks to detect and segment objects in each frame, followed by associating these detections in consecutive frames using a learned tracking head. In this work, however, we introduce a simple end-to-end trainable bottom-up approach to achieve instance mask predictions at the pixel-level granularity, instead of the typical region-proposals-based approach. Unlike contemporary frame-based models, our network pipeline processes an input video clip as a single 3D volume to incorporate temporal information. The central idea of our formulation is to solve the video instance segmentation task as a tag assignment problem,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVisual Attention and Saliency Detection · Video Surveillance and Tracking Methods · Advanced Image and Video Retrieval Techniques
MethodsContrastive Language-Image Pre-training
