A Simple Video Segmenter by Tracking Objects Along Axial Trajectories
Ju He, Qihang Yu, Inkyu Shin, Xueqing Deng, Alan Yuille, Xiaohui Shen,, Liang-Chieh Chen

TL;DR
Axial-VS introduces a simple, efficient framework for video segmentation that tracks objects along axial trajectories, improving temporal consistency and outperforming existing methods on benchmarks.
Contribution
It proposes axial-trajectory attention to enhance clip-level video segmentation with better temporal consistency and computational efficiency.
Findings
Achieves state-of-the-art results on video segmentation benchmarks.
Reduces computational complexity compared to traditional attention methods.
Effectively maintains object tracking across video clips.
Abstract
Video segmentation requires consistently segmenting and tracking objects over time. Due to the quadratic dependency on input size, directly applying self-attention to video segmentation with high-resolution input features poses significant challenges, often leading to insufficient GPU memory capacity. Consequently, modern video segmenters either extend an image segmenter without incorporating any temporal attention or resort to window space-time attention in a naive manner. In this work, we present Axial-VS, a general and simple framework that enhances video segmenters by tracking objects along axial trajectories. The framework tackles video segmentation through two sub-tasks: short-term within-clip segmentation and long-term cross-clip tracking. In the first step, Axial-VS augments an off-the-shelf clip-level video segmenter with the proposed axial-trajectory attention, sequentially…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Surveillance and Tracking Methods · Visual Attention and Saliency Detection · Advanced Image and Video Retrieval Techniques
MethodsHigh-resolution input · Contrastive Language-Image Pre-training
