SAG-ViT: A Scale-Aware, High-Fidelity Patching Approach with Graph Attention for Vision Transformers
Shravan Venkatraman, Jaskaran Singh Walia, Joe Dhanith P R

TL;DR
SAG-ViT introduces a novel scale-aware graph attention mechanism that enhances vision transformers by integrating multi-scale features and spatial hierarchies for improved image classification.
Contribution
It presents a new architecture combining CNN-based multi-scale features, graph attention, and transformers to better capture spatial hierarchies in images.
Findings
Outperforms existing ViT models on benchmark datasets
Effectively captures multi-scale and long-range dependencies
Enhances image classification accuracy
Abstract
Vision Transformers (ViTs) have redefined image classification by leveraging self-attention to capture complex patterns and long-range dependencies between image patches. However, a key challenge for ViTs is efficiently incorporating multi-scale feature representations, which is inherent in convolutional neural networks (CNNs) through their hierarchical structure. Graph transformers have made strides in addressing this by leveraging graph-based modeling, but they often lose or insufficiently represent spatial hierarchies, especially since redundant or less relevant areas dilute the image's contextual representation. To bridge this gap, we propose SAG-ViT, a Scale-Aware Graph Attention ViT that integrates multi-scale feature capabilities of CNNs, representational power of ViTs, graph-attended patching to enable richer contextual representation. Using EfficientNetV2 as a backbone, the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGraph Theory and Algorithms · Visual Attention and Saliency Detection · Advanced Neural Network Applications
MethodsAttention Is All You Need · EfficientNetV2 · Activation Patching · Sigmoid Activation · Depthwise Convolution · Convolution · (FiLe@Against@Claim)How do I file a claim against Expedia? · *Communicated@Fast*How Do I Communicate to Expedia? · Squeeze-and-Excitation Block · Absolute Position Encodings
