AIM: Adapting Image Models for Efficient Video Action Recognition
Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, Mu Li

TL;DR
AIM introduces a method to adapt pre-trained image models for video action recognition by adding lightweight adapters, enabling efficient spatiotemporal reasoning with fewer parameters and competitive performance.
Contribution
The paper presents a novel approach to adapt pre-trained image models for video understanding using lightweight adapters, reducing computational cost while maintaining high accuracy.
Findings
Achieves competitive or better performance than prior methods.
Uses significantly fewer tunable parameters.
Applicable to various pre-trained image models.
Abstract
Recent vision transformer based video models mostly follow the ``image pre-training then finetuning" paradigm and have achieved great success on multiple video benchmarks. However, full finetuning such a video model could be computationally expensive and unnecessary, given the pre-trained image transformer models have demonstrated exceptional transferability. In this work, we propose a novel method to Adapt pre-trained Image Models (AIM) for efficient video understanding. By freezing the pre-trained image model and adding a few lightweight Adapters, we introduce spatial adaptation, temporal adaptation and joint adaptation to gradually equip an image model with spatiotemporal reasoning capability. We show that our proposed AIM can achieve competitive or even better performance than prior arts with substantially fewer tunable parameters on four video action recognition benchmarks. Thanks…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Diabetic Foot Ulcer Assessment and Management
MethodsAttention Is All You Need · Softmax · Residual Connection · Dense Connections · Linear Layer · Layer Normalization · Multi-Head Attention · Vision Transformer
