Optimizing ViViT Training: Time and Memory Reduction for Action   Recognition

Shreyank N Gowda; Anurag Arnab; Jonathan Huang

arXiv:2306.04822·cs.CV·June 9, 2023·1 cites

Optimizing ViViT Training: Time and Memory Reduction for Action Recognition

Shreyank N Gowda, Anurag Arnab, Jonathan Huang

PDF

Open Access

TL;DR

This paper proposes a training strategy for ViViT video transformers that reduces training time and memory usage by freezing the spatial transformer and using a compact adapter, while maintaining or improving accuracy.

Contribution

The authors introduce a novel training method involving freezing the spatial transformer and using an adapter, enabling efficient training of ViViT models without accuracy loss.

Findings

01

Training costs reduced by approximately 50%

02

Memory consumption decreased significantly

03

Achieved up to 1.79% performance improvement

Abstract

In this paper, we address the challenges posed by the substantial training time and memory consumption associated with video transformers, focusing on the ViViT (Video Vision Transformer) model, in particular the Factorised Encoder version, as our baseline for action recognition tasks. The factorised encoder variant follows the late-fusion approach that is adopted by many state of the art approaches. Despite standing out for its favorable speed/accuracy tradeoffs among the different variants of ViViT, its considerable training time and memory requirements still pose a significant barrier to entry. Our method is designed to lower this barrier and is based on the idea of freezing the spatial transformer during training. This leads to a low accuracy model if naively done. But we show that by (1) appropriately initializing the temporal transformer (a module responsible for processing…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Advanced Neural Network Applications · Multimodal Machine Learning Applications

MethodsSpatial Transformer · Adapter