Video Patch Pruning: Efficient Video Instance Segmentation via Early Token Reduction
Patrick Glandorf, Thomas Norrenbrock, Bodo Rosenhahn

TL;DR
This paper introduces Video Patch Pruning (VPP), a novel method that leverages temporal prior knowledge to enable early-stage patch reduction in Vision Transformers for video segmentation, achieving up to 60% sparsity with minimal performance loss.
Contribution
The work presents a fully differentiable temporal mapping module for early patch pruning in ViTs, significantly improving efficiency in dense video prediction tasks.
Findings
Up to 60% patch reduction in dense prediction tasks.
Maintains performance with less than 0.6% accuracy drop on Youtube-VIS 2021.
Outperforms conventional image-based patch pruning in high-sparsity regimes.
Abstract
Vision Transformers (ViTs) have demonstrated state-ofthe-art performance in several benchmarks, yet their high computational costs hinders their practical deployment. Patch Pruning offers significant savings, but existing approaches restrict token reduction to deeper layers, leaving early-stage compression unexplored. This limits their potential for holistic efficiency. In this work, we present a novel Video Patch Pruning framework (VPP) that integrates temporal prior knowledge to enable efficient sparsity within early ViT layers. Our approach is motivated by the observation that prior features extracted from deeper layers exhibit strong foreground selectivity. Therefore we propose a fully differentiable module for temporal mapping to accurately select the most relevant patches in early network stages. Notably, the proposed method enables a patch reduction of up to 60% in dense…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
