MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation
Zhongzhi Yu, Yonggan Fu, Sicheng Li, Chaojian Li, Yingyan Lin

TL;DR
MIA-Former is a novel vision transformer framework that adaptively adjusts its structure at multiple granularities based on input complexity, reducing computation and enhancing robustness against adversarial attacks.
Contribution
It introduces a multi-grained input-adaptive mechanism for ViTs, enabling dynamic skipping of layers, heads, and tokens, which improves efficiency and robustness.
Findings
Achieves 20% computation savings with comparable or higher accuracy.
Demonstrates improved robustness against adversarial attacks.
Validates effectiveness through extensive experiments and ablation studies.
Abstract
ViTs are often too computationally expensive to be fitted onto real-world resource-constrained devices, due to (1) their quadratically increased complexity with the number of input tokens and (2) their overparameterized self-attention heads and model depth. In parallel, different images are of varied complexity and their different regions can contain various levels of visual information, indicating that treating all regions/tokens equally in terms of model complexity is unnecessary while such opportunities for trimming down ViTs' complexity have not been fully explored. To this end, we propose a Multi-grained Input-adaptive Vision Transformer framework dubbed MIA-Former that can input-adaptively adjust the structure of ViTs at three coarse-to-fine-grained granularities (i.e., model depth and the number of model heads/tokens). In particular, our MIA-Former adopts a low-cost network…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Brain Tumor Detection and Classification
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Absolute Position Encodings · Residual Connection · Dropout · Label Smoothing
