MIA-Former: Efficient and Robust Vision Transformers via Multi-grained   Input-Adaptation

Zhongzhi Yu; Yonggan Fu; Sicheng Li; Chaojian Li; Yingyan Lin

arXiv:2112.11542·cs.CV·December 23, 2021

MIA-Former: Efficient and Robust Vision Transformers via Multi-grained Input-Adaptation

Zhongzhi Yu, Yonggan Fu, Sicheng Li, Chaojian Li, Yingyan Lin

PDF

Open Access 1 Video

TL;DR

MIA-Former is a novel vision transformer framework that adaptively adjusts its structure at multiple granularities based on input complexity, reducing computation and enhancing robustness against adversarial attacks.

Contribution

It introduces a multi-grained input-adaptive mechanism for ViTs, enabling dynamic skipping of layers, heads, and tokens, which improves efficiency and robustness.

Findings

01

Achieves 20% computation savings with comparable or higher accuracy.

02

Demonstrates improved robustness against adversarial attacks.

03

Validates effectiveness through extensive experiments and ablation studies.

Abstract

ViTs are often too computationally expensive to be fitted onto real-world resource-constrained devices, due to (1) their quadratically increased complexity with the number of input tokens and (2) their overparameterized self-attention heads and model depth. In parallel, different images are of varied complexity and their different regions can contain various levels of visual information, indicating that treating all regions/tokens equally in terms of model complexity is unnecessary while such opportunities for trimming down ViTs' complexity have not been fully explored. To this end, we propose a Multi-grained Input-adaptive Vision Transformer framework dubbed MIA-Former that can input-adaptively adjust the structure of ViTs at three coarse-to-fine-grained granularities (i.e., model depth and the number of model heads/tokens). In particular, our MIA-Former adopts a low-cost network…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MIA-Former: Efficient and Robust Vision Transformers via Multi-Grained Input-Adaptation· underline

Taxonomy

TopicsAdvanced Neural Network Applications · Adversarial Robustness in Machine Learning · Brain Tumor Detection and Classification

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Adam · Layer Normalization · Absolute Position Encodings · Residual Connection · Dropout · Label Smoothing