Focus Your Attention: Towards Data-Intuitive Lightweight Vision Transformers
Suyash Gaurav, Muhammad Farhan Humayun, Jukka Heikkonen, Jatin Chaudhary

TL;DR
This paper introduces a lightweight, energy-efficient vision transformer architecture using novel patch pooling and attention modules, achieving comparable accuracy to state-of-the-art models while significantly reducing computational costs.
Contribution
The paper proposes the Super-Pixel Based Patch Pooling and Light Latent Attention modules, enabling efficient, task-specific vision transformers with reduced complexity and improved training speed.
Findings
Significant reduction in computational complexity and energy consumption.
Achieves comparable accuracy to state-of-the-art models.
Enhanced training efficiency and convergence speed.
Abstract
The evolution of Vision Transformers has led to their widespread adaptation to different domains. Despite large-scale success, there remain significant challenges including their reliance on extensive computational and memory resources for pre-training on huge datasets as well as difficulties in task-specific transfer learning. These limitations coupled with energy inefficiencies mainly arise due to the computation-intensive self-attention mechanism. To address these issues, we propose a novel Super-Pixel Based Patch Pooling (SPPP) technique that generates context-aware, semantically rich, patch embeddings to effectively reduce the architectural complexity and improve efficiency. Additionally, we introduce the Light Latent Attention (LLA) module in our pipeline by integrating latent tokens into the attention mechanism allowing cross-attention operations to significantly reduce the time…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
