LL-ViT: Edge Deployable Vision Transformers with Look Up Table Neurons
Shashank Nag, Alan T.L. Bacellar, Zachary Susskind, Anshul Jha, Logan Liberty, Aishwarya Sivakumar, Eugene B. John, Krishnan Kailas, Priscila M.V. Lima, Neeraja J. Yadwadkar, Felipe M.G. Franca, Lizy K. John

TL;DR
This paper introduces LL-ViT, an edge-efficient vision transformer that integrates LUT neurons and FPGA acceleration, achieving high accuracy with significantly reduced model size, computations, and energy consumption.
Contribution
The paper presents a novel LUT-based vision transformer architecture optimized for edge devices, with an FPGA accelerator, improving efficiency while maintaining accuracy.
Findings
Achieves 95.5% accuracy on CIFAR-10
Reduces model weights by over 60%
Offers 1.9x energy efficiency and 1.3x lower latency
Abstract
Vision Transformers have been tremendously successful in computer vision tasks. However, their large computational, memory, and energy demands are a challenge for edge inference on FPGAs -- a field that has seen a recent surge in demand. We recognize the benefits of recent works on logic and Look Up Table (LUT) based networks, such as LogicNets, NeuraLUT, DWN, among others, in offering models that simultaneously reduce both the memory and compute footprints. However, these models natively do not perform well on common vision tasks, such as CIFAR-10/100. In this work, we propose LL-ViT, a novel edge optimized vision transformer design that integrates layers of LUT neurons within the transformer architecture. Based on our characterization that reveals that a majority of model weights and computations are from the channel mixer (MLP layer), we design an alternate LUT-based channel mixer,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
