Activator: GLU Activation Function as the Core Component of a Vision Transformer
Abdullah Nazhat Abdullah, Tarkan Aydin

TL;DR
This paper proposes replacing the traditional attention mechanism in vision transformers with a GLU-based architecture to reduce computational costs while maintaining competitive performance.
Contribution
It introduces a novel transformer architecture using GLU activation functions, offering a more efficient alternative to standard attention-based models in vision tasks.
Findings
GLU-based architecture reduces computational complexity.
Competitive performance achieved compared to baseline models.
Supports more efficient vision transformer designs.
Abstract
The transformer architecture has driven many successes in a variety of tasks within the field of deep learning, in particular the recent advances in natural language processing (NLP) culminating with large language models (LLM). Adding to that success, transformer architecture has found widespread interest from computer vision (CV) researchers and practitioners, allowing for many advancements in vision-related tasks and opening the door for multitask and multi-modal deep learning architectures that share the same principle of operation. One drawback to these architectures is their reliance on the scaled dot product attention mechanism with the softmax activation function, which is computationally expensive and requires large compute capabilities for both training and inference. This paper investigates substituting the MLP and attention mechanism usually adopted for transformer…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors
MethodsSoftmax
