VA-RED$^2$: Video Adaptive Redundancy Reduction
Bowen Pan, Rameswar Panda, Camilo Fosco, Chung-Ching Lin, Alex, Andonian, Yue Meng, Kate Saenko, Aude Oliva, Rogerio Feris

TL;DR
VA-RED$^2$ is an adaptive framework that reduces computational redundancy in video deep learning models by selectively computing features based on input content, achieving significant efficiency gains without performance loss.
Contribution
It introduces an input-dependent, differentiable policy for adaptive feature computation in videos, effectively reducing redundancy and computational cost.
Findings
Achieves 20-40% reduction in FLOPs compared to state-of-the-art methods.
Maintains original model performance despite computational reduction.
Demonstrates effectiveness across multiple datasets and tasks.
Abstract
Performing inference on deep learning models for videos remains a challenge due to the large amount of computational resources required to achieve robust recognition. An inherent property of real-world videos is the high correlation of information across frames which can translate into redundancy in either temporal or spatial feature maps of the models, or both. The type of redundant features depends on the dynamics and type of events in the video: static videos have more temporal redundancy while videos focusing on objects tend to have more channel redundancy. Here we present a redundancy reduction framework, termed VA-RED, which is input-dependent. Specifically, our VA-RED framework uses an input-dependent policy to decide how many features need to be computed for temporal and channel dimensions. To keep the capacity of the original model, after fully computing the necessary…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning
