Temporal Cluster Assignment for Efficient Real-Time Video Segmentation
Ka-Wai Yung, Felix J. S. Bragman, Jialang Xu, Imanol Luengo, Danail Stoyanov, Evangelos B. Mazomenos

TL;DR
This paper introduces Temporal Cluster Assignment (TCA), a novel, lightweight, and training-free method that leverages temporal coherence to improve token clustering, significantly reducing computation while maintaining accuracy in real-time video segmentation.
Contribution
TCA enhances token clustering by exploiting temporal redundancy, enabling more efficient and accurate real-time video segmentation without additional training.
Findings
TCA improves accuracy-speed trade-off across multiple datasets.
TCA generalizes well to both natural and domain-specific videos.
TCA reduces computational cost significantly while maintaining fine-grained details.
Abstract
Vision Transformers have substantially advanced the capabilities of segmentation models across both image and video domains. Among them, the Swin Transformer stands out for its ability to capture hierarchical, multi-scale representations, making it a popular backbone for segmentation in videos. However, despite its window-attention scheme, it still incurs a high computational cost, especially in larger variants commonly used for dense prediction in videos. This remains a major bottleneck for real-time, resource-constrained applications. Whilst token reduction methods have been proposed to alleviate this, the window-based attention mechanism of Swin requires a fixed number of tokens per window, limiting the applicability of conventional pruning techniques. Meanwhile, training-free token clustering approaches have shown promise in image segmentation while maintaining window consistency.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Advanced Image and Video Retrieval Techniques · Advanced Vision and Imaging
