Scattering Vision Transformer: Spectral Mixing Matters
Badri N. Patro, Vijay Srinivas Agneeswaran

TL;DR
The paper introduces the Scattering Vision Transformer (SVT), a novel model that captures detailed image information efficiently using spectral methods, achieving state-of-the-art results with reduced complexity.
Contribution
SVT incorporates spectral scattering and gating mechanisms to improve detail capture and reduce computational complexity in vision transformers.
Findings
SVT achieves state-of-the-art accuracy on ImageNet.
SVT reduces parameters and FLOPS compared to previous models.
SVT performs well in transfer learning and other vision tasks.
Abstract
Vision transformers have gained significant attention and achieved state-of-the-art performance in various computer vision tasks, including image classification, instance segmentation, and object detection. However, challenges remain in addressing attention complexity and effectively capturing fine-grained information within images. Existing solutions often resort to down-sampling operations, such as pooling, to reduce computational cost. Unfortunately, such operations are non-invertible and can result in information loss. In this paper, we present a novel approach called Scattering Vision Transformer (SVT) to tackle these challenges. SVT incorporates a spectrally scattering network that enables the capture of intricate image details. SVT overcomes the invertibility issue associated with down-sampling operations by separating low-frequency and high-frequency components. Furthermore, SVT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Neural Network Applications · Brain Tumor Detection and Classification · Remote-Sensing Image Classification
MethodsAttention Is All You Need · Dense Connections · Dropout · Byte Pair Encoding · Softmax · Layer Normalization · Linear Layer · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Label Smoothing
