Exploiting Information Redundancy in Attention Maps for Extreme Quantization of Vision Transformers
Lucas Maisonnave, Karim Haroun, Tom Pegeot

TL;DR
This paper introduces Entropy Attention Maps (EAM), a method that leverages information redundancy in attention maps to significantly reduce computation and memory in vision transformers without sacrificing accuracy.
Contribution
The paper proposes a novel approach that quantizes low-entropy attention maps in vision transformers, enabling efficient inference with minimal accuracy loss.
Findings
EAM achieves up to 20% sparsity in attention maps with maintained accuracy.
Quantizing low-entropy attention maps reduces computational complexity.
EAM performs well on ImageNet-1k with DeiT and Swin Transformer models.
Abstract
Transformer models rely on Multi-Head Self-Attention (MHSA) mechanisms, where each attention head contributes to the final representation. However, their computational complexity and high memory demands due to MHSA hinders their deployment at the edge. In this work, we analyze and exploit information redundancy in attention maps to accelerate model inference. By quantifying the information captured by each attention head using Shannon entropy, our analysis reveals that attention heads with lower entropy, i.e., exhibiting more deterministic behavior, tend to contribute less information, motivating targeted compression strategies. Relying on these insights, we propose Entropy Attention Maps (EAM), a model that freezes the weights of low-entropy attention maps and quantizes these values to low precision to avoid redundant re-computation. Empirical validation on ImageNet-1k shows that EAM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
