FoveaTer: Foveated Transformer for Image Classification
Aditya Jonnalagadda, William Yang Wang, B. S. Manjunath, Miguel P., Eckstein

TL;DR
FoveaTer introduces a foveated vision transformer that mimics biological eye movements and peripheral vision, improving scene classification efficiency and robustness against adversarial attacks.
Contribution
This paper presents the first foveated transformer architecture that incorporates eye movement-inspired pooling and attention mechanisms for image classification.
Findings
FoveaTer outperforms baseline models in scene categorization tasks.
The model better explains human decision-making in visual tasks.
FoveaTer shows increased robustness to adversarial attacks.
Abstract
Many animals and humans process the visual field with a varying spatial resolution (foveated vision) and use peripheral processing to make eye movements and point the fovea to acquire high-resolution information about objects of interest. This architecture results in computationally efficient rapid scene exploration. Recent progress in self-attention-based Vision Transformers, an alternative to the traditionally convolution-reliant computer vision systems. However, the Transformer models do not explicitly model the foveated properties of the visual system nor the interaction between eye movements and the classification task. We propose Foveated Transformer (FoveaTer) model, which uses pooling regions and eye movements to perform object classification tasks using a Vision Transformer architecture. Using square pooling regions or biologically-inspired radial-polar pooling regions, our…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdversarial Robustness in Machine Learning · Domain Adaptation and Few-Shot Learning · Image Processing Techniques and Applications
MethodsAttention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Vision Transformer · Label Smoothing · Layer Normalization · Byte Pair Encoding · Residual Connection · Dropout
