EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention
Yulong Shi, Mingwei Sun, Yongshuai Wang, Jiahao Ma, Zengqiang Chen

TL;DR
EViT introduces a biologically inspired vision transformer architecture that mimics eagle eye features, enhancing performance and efficiency in vision tasks through novel self-attention and hierarchical processing mechanisms.
Contribution
The paper proposes a new eagle-inspired vision transformer architecture with Bi-Fovea self-attention and hierarchical processing, improving accuracy and computational efficiency.
Findings
EViT achieves competitive results in image classification, object detection, and segmentation.
EViT demonstrates superior performance and efficiency compared to existing models.
The proposed architecture effectively mimics biological visual processing.
Abstract
Owing to advancements in deep learning technology, Vision Transformers (ViTs) have demonstrated impressive performance in various computer vision tasks. Nonetheless, ViTs still face some challenges, such as high computational complexity and the absence of desirable inductive biases. To alleviate these issues, {the potential advantages of combining eagle vision with ViTs are explored. We summarize a Bi-Fovea Visual Interaction (BFVI) structure inspired by the unique physiological and visual characteristics of eagle eyes. A novel Bi-Fovea Self-Attention (BFSA) mechanism and Bi-Fovea Feedforward Network (BFFN) are proposed based on this structural design approach, which can be used to mimic the hierarchical and parallel information processing scheme of the biological visual cortex, enabling networks to learn feature representations of targets in a coarse-to-fine manner. Furthermore, a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRemote-Sensing Image Classification · Currency Recognition and Detection · Visual Attention and Saliency Detection
MethodsConvolution · Feedforward Network · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Linear Layer · Label Smoothing · Adam · Absolute Position Encodings
