MABViT -- Modified Attention Block Enhances Vision Transformers

Mahesh Ramesh; Aswinkumar Ramkumar

arXiv:2312.01324·cs.CV·January 2, 2024·1 cites

MABViT -- Modified Attention Block Enhances Vision Transformers

Mahesh Ramesh, Aswinkumar Ramkumar

PDF

Open Access

TL;DR

This paper introduces MABViT, a novel vision transformer variant that incorporates non-linearity within the attention block, leading to improved accuracy and efficiency on ImageNet-1K, especially in deep architectures.

Contribution

The paper proposes a new transformer architecture with integrated non-linearity in the attention block, outperforming state-of-the-art models with fewer parameters.

Findings

01

Surpasses S/16 Vision Transformer by 0.6% on ImageNet-1K.

02

Outperforms B/16 variant with half the parameters.

03

Deep MABViT variants show greater potential than standard architectures.

Abstract

Recent studies have demonstrated the effectiveness of Gated Linear Units (GLU) in enhancing transformer models, particularly in Large Language Models (LLMs). Additionally, utilizing a parallel configuration within each Transformer block rather than the conventional serialized method has been revealed to accelerate the training of LLMs without significantly impacting performance. However, when the MLP and attention block were run in parallel for the image classification task, we observed a noticeable decline in performance. We propose a novel transformer variant that integrates non-linearity within the attention block to tackle this problem. We implemented the GLU-based activation function on the Value tensor, and this new technique surpasses the current state-of-the-art S/16 variant of Vision Transformers by 0.6% on the ImageNet-1K dataset while utilizing fewer parameters. It also…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Multimodal Machine Learning Applications

MethodsMulti-Head Attention · Linear Layer · Attention Is All You Need · Absolute Position Encodings · Dropout · Dense Connections · Byte Pair Encoding · Softmax · Layer Normalization · Position-Wise Feed-Forward Layer