FViT: A Focal Vision Transformer with Gabor Filter
Yulong Shi, Mingwei Sun, Yongshuai Wang, Zengqiang Chen

TL;DR
FViT introduces a novel vision transformer architecture that integrates learnable Gabor filters and biologically inspired modules to improve efficiency and performance in dense vision tasks.
Contribution
This paper proposes FViT, a new vision transformer framework combining Gabor filters and neuroscience-inspired blocks for better feature focus and reduced complexity.
Findings
FViT outperforms existing models in multiple vision tasks.
FViT demonstrates higher computational efficiency and scalability.
The biologically inspired design improves feature discrimination across scales.
Abstract
Vision transformers have achieved encouraging progress in various computer vision tasks. A common belief is that this is attributed to the capability of self-attention in modeling the global dependencies among feature tokens. However, self-attention still faces several challenges in dense prediction tasks, including high computational complexity and absence of desirable inductive bias. To alleviate these issues, the potential advantages of combining vision transformers with Gabor filters are revisited, and a learnable Gabor filter (LGF) using convolution is proposed. The LGF does not rely on self-attention, and it is used to simulate the response of fundamental cells in the biological visual system to the input images. This encourages vision transformers to focus on discriminative feature representations of targets across different scales and orientations. In addition, a Bionic Focal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsCCD and CMOS Imaging Sensors · Infrared Target Detection Methodologies
MethodsAttention Is All You Need · Convolution · Linear Layer · Softmax · Multi-Head Attention · Layer Normalization · Residual Connection · Dense Connections · Focus · Vision Transformer
