Reversible Vision Transformers
Karttikeya Mangalam, Haoqi Fan, Yanghao Li, Chao-Yuan Wu, Bo Xiong,, Christoph Feichtenhofer, Jitendra Malik

TL;DR
Reversible Vision Transformers significantly reduce memory usage in visual recognition tasks by enabling reversible architectures, allowing for larger models and faster training throughput without sacrificing accuracy.
Contribution
This work introduces reversible variants of Vision Transformers, enabling scalable, memory-efficient models suitable for resource-limited training environments.
Findings
Memory footprint reduced by up to 15.5x
Throughput increased by up to 2.3x for deeper models
Achieved comparable accuracy across multiple tasks
Abstract
We present Reversible Vision Transformers, a memory efficient architecture design for visual recognition. By decoupling the GPU memory requirement from the depth of the model, Reversible Vision Transformers enable scaling up architectures with efficient memory usage. We adapt two popular models, namely Vision Transformer and Multiscale Vision Transformers, to reversible variants and benchmark extensively across both model sizes and tasks of image classification, object detection and video classification. Reversible Vision Transformers achieve a reduced memory footprint of up to 15.5x at roughly identical model complexity, parameters and accuracy, demonstrating the promise of reversible vision transformers as an efficient backbone for hardware resource limited training regimes. Finally, we find that the additional computational burden of recomputing activations is more than overcome for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Memory and Neural Computing · Advanced Image and Video Retrieval Techniques
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Layer Normalization · Multi-Head Attention · Position-Wise Feed-Forward Layer · Vision Transformer · Adam · Label Smoothing · Softmax
