Dynamic and Compressive Adaptation of Transformers From Images to Videos

Guozhen Zhang; Jingyu Liu; Shengming Cao; Xiaotong Zhao; Kevin Zhao,; Kai Ma; Limin Wang

arXiv:2408.06840·cs.CV·August 15, 2024

Dynamic and Compressive Adaptation of Transformers From Images to Videos

Guozhen Zhang, Jingyu Liu, Shengming Cao, Xiaotong Zhao, Kevin Zhao,, Kai Ma, Limin Wang

PDF

Open Access

TL;DR

This paper introduces InTI, a method that adaptively compresses video tokens to reduce computation in vision transformers, maintaining high accuracy while halving processing costs.

Contribution

InTI is a novel, seamless approach for compressive image-to-video adaptation using dynamic token interpolation, significantly reducing computation without sacrificing performance.

Findings

01

Achieves 87.1% top-1 accuracy on Kinetics-400.

02

Reduces GFLOPs by 37.5% compared to naive adaptation.

03

Maintains strong performance with additional temporal modules.

Abstract

Recently, the remarkable success of pre-trained Vision Transformers (ViTs) from image-text matching has sparked an interest in image-to-video adaptation. However, most current approaches retain the full forward pass for each frame, leading to a high computation overhead for processing entire videos. In this paper, we present InTI, a novel approach for compressive image-to-video adaptation using dynamic Inter-frame Token Interpolation. InTI aims to softly preserve the informative tokens without disrupting their coherent spatiotemporal structure. Specifically, each token pair at identical positions within neighbor frames is linearly aggregated into a new token, where the aggregation weights are generated by a multi-scale context-aware network. In this way, the information of neighbor frames can be adaptively compressed in a point-by-point manner, thereby effectively reducing the number of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsIndustrial Vision Systems and Defect Detection · Generative Adversarial Networks and Image Synthesis · Image and Signal Denoising Methods