TReX- Reusing Vision Transformer's Attention for Efficient Xbar-based Computing
Abhishek Moitra, Abhiroop Bhattacharjee, Youngeun Kim and, Priyadarshini Panda

TL;DR
TReX is a framework that reuses attention in Vision Transformers to optimize accuracy, energy, delay, and area, significantly improving efficiency while maintaining near-accuracy levels.
Contribution
It introduces an attention reuse method for ViTs that balances accuracy and efficiency, addressing prior neglect of attention block overheads in IMC implementations.
Findings
Achieves 2.3x EDAP reduction on Imagenet-1k
Improves TOPS/mm2 by 1.86x with minimal accuracy loss
Outperforms state-of-the-art token pruning and weight sharing methods
Abstract
Due to the high computation overhead of Vision Transformers (ViTs), In-memory Computing architectures are being researched towards energy-efficient deployment in edge-computing scenarios. Prior works have proposed efficient algorithm-hardware co-design and IMC-architectural improvements to improve the energy-efficiency of IMC-implemented ViTs. However, all prior works have neglected the overhead and co-depencence of attention blocks on the accuracy-energy-delay-area of IMC-implemented ViTs. To this end, we propose TReX- an attention-reuse-driven ViT optimization framework that effectively performs attention reuse in ViT models to achieve optimal accuracy-energy-delay-area tradeoffs. TReX optimally chooses the transformer encoders for attention reuse to achieve near iso-accuracy performance while meeting the user-specified delay requirement. Based on our analysis on the Imagenet-1k…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors
MethodsSoftmax · Attention Is All You Need · Pruning
