MABViT -- Modified Attention Block Enhances Vision Transformers
Mahesh Ramesh, Aswinkumar Ramkumar

TL;DR
This paper introduces MABViT, a novel vision transformer variant that incorporates non-linearity within the attention block, leading to improved accuracy and efficiency on ImageNet-1K, especially in deep architectures.
Contribution
The paper proposes a new transformer architecture with integrated non-linearity in the attention block, outperforming state-of-the-art models with fewer parameters.
Findings
Surpasses S/16 Vision Transformer by 0.6% on ImageNet-1K.
Outperforms B/16 variant with half the parameters.
Deep MABViT variants show greater potential than standard architectures.
Abstract
Recent studies have demonstrated the effectiveness of Gated Linear Units (GLU) in enhancing transformer models, particularly in Large Language Models (LLMs). Additionally, utilizing a parallel configuration within each Transformer block rather than the conventional serialized method has been revealed to accelerate the training of LLMs without significantly impacting performance. However, when the MLP and attention block were run in parallel for the image classification task, we observed a noticeable decline in performance. We propose a novel transformer variant that integrates non-linearity within the attention block to tackle this problem. We implemented the GLU-based activation function on the Value tensor, and this new technique surpasses the current state-of-the-art S/16 variant of Vision Transformers by 0.6% on the ImageNet-1K dataset while utilizing fewer parameters. It also…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications
MethodsMulti-Head Attention · Linear Layer · Attention Is All You Need · Absolute Position Encodings · Dropout · Dense Connections · Byte Pair Encoding · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer
