TL;DR
This paper introduces LGViT, a novel early exiting framework for vision transformers that balances efficiency and accuracy, achieving significant speed-up with minimal performance loss on multiple datasets.
Contribution
The paper proposes a new early exiting method for ViTs with heterogeneous heads and a two-stage training scheme, improving inference speed while maintaining accuracy.
Findings
LGViT achieves approximately 1.8× speed-up.
Extensive experiments validate LGViT's effectiveness across multiple ViT backbones.
The method maintains competitive performance with reduced inference time.
Abstract
Recently, the efficient deployment and acceleration of powerful vision transformers (ViTs) on resource-limited edge devices for providing multimedia services have become attractive tasks. Although early exiting is a feasible solution for accelerating inference, most works focus on convolutional neural networks (CNNs) and transformer models in natural language processing (NLP).Moreover, the direct application of early exiting methods to ViTs may result in substantial performance degradation. To tackle this challenge, we systematically investigate the efficacy of early exiting in ViTs and point out that the insufficient feature representations in shallow internal classifiers and the limited ability to capture target semantic information in deep internal classifiers restrict the performance of these methods. We then propose an early exiting framework for general ViTs termed LGViT, which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsEarly exiting using confidence measures · Focus
