AIM: Adapting Image Models for Efficient Video Action Recognition

Taojiannan Yang; Yi Zhu; Yusheng Xie; Aston Zhang; Chen Chen; Mu Li

arXiv:2302.03024·cs.CV·February 7, 2023·62 cites

AIM: Adapting Image Models for Efficient Video Action Recognition

Taojiannan Yang, Yi Zhu, Yusheng Xie, Aston Zhang, Chen Chen, Mu Li

PDF

Open Access 1 Repo 1 Video

TL;DR

AIM introduces a method to adapt pre-trained image models for video action recognition by adding lightweight adapters, enabling efficient spatiotemporal reasoning with fewer parameters and competitive performance.

Contribution

The paper presents a novel approach to adapt pre-trained image models for video understanding using lightweight adapters, reducing computational cost while maintaining high accuracy.

Findings

01

Achieves competitive or better performance than prior methods.

02

Uses significantly fewer tunable parameters.

03

Applicable to various pre-trained image models.

Abstract

Recent vision transformer based video models mostly follow the ``image pre-training then finetuning" paradigm and have achieved great success on multiple video benchmarks. However, full finetuning such a video model could be computationally expensive and unnecessary, given the pre-trained image transformer models have demonstrated exceptional transferability. In this work, we propose a novel method to Adapt pre-trained Image Models (AIM) for efficient video understanding. By freezing the pre-trained image model and adding a few lightweight Adapters, we introduce spatial adaptation, temporal adaptation and joint adaptation to gradually equip an image model with spatiotemporal reasoning capability. We show that our proposed AIM can achieve competitive or even better performance than prior arts with substantially fewer tunable parameters on four video action recognition benchmarks. Thanks…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

taoyang1122/adapt-image-models
pytorch

Videos

AIM: Adapting Image Models for Efficient Video Action Recognition· slideslive

Taxonomy

TopicsHuman Pose and Action Recognition · Multimodal Machine Learning Applications · Diabetic Foot Ulcer Assessment and Management

MethodsAttention Is All You Need · Softmax · Residual Connection · Dense Connections · Linear Layer · Layer Normalization · Multi-Head Attention · Vision Transformer