RAViT: Resolution-Adaptive Vision Transformer
Martial Guidez, Stefan Duffner, Christophe Garcia

TL;DR
RAViT introduces a resolution-adaptive vision transformer framework that reduces computational costs by multi-resolution processing and early exit strategies, maintaining accuracy comparable to traditional transformers.
Contribution
The paper presents a novel multi-branch, resolution-adaptive vision transformer with an early exit mechanism for efficient image classification.
Findings
Achieves similar accuracy to classical transformers with 30% fewer FLOPs.
Effective multi-resolution processing reduces computational cost.
Early exit mechanism allows dynamic trade-off between accuracy and efficiency.
Abstract
Vision transformers have recently made a breakthrough in computer vision showing excellent performance in terms of precision for numerous applications. However, their computational cost is very high compared to alternative approaches such as Convolutional Neural Networks. To address this problem, we propose a novel framework for image classification called RAViT based on a multi-branch network that operates on several copies of the same image with different resolutions to reduce the computational cost while preserving the overall accuracy. Furthermore, our framework includes an early exit mechanism that makes our model adaptive and allows to choose the appropriate trade-off between accuracy and computational cost at run-time. For example in a two-branch architecture, the original image is first resized to reduce its resolution, then a prediction is performed on it using a first…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · CCD and CMOS Imaging Sensors · Advanced Memory and Neural Computing
