EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention

Yulong Shi; Mingwei Sun; Yongshuai Wang; Jiahao Ma; Zengqiang Chen

arXiv:2310.06629·cs.CV·February 11, 2025·2 cites

EViT: An Eagle Vision Transformer with Bi-Fovea Self-Attention

Yulong Shi, Mingwei Sun, Yongshuai Wang, Jiahao Ma, Zengqiang Chen

PDF

Open Access 1 Repo

TL;DR

EViT introduces a biologically inspired vision transformer architecture that mimics eagle eye features, enhancing performance and efficiency in vision tasks through novel self-attention and hierarchical processing mechanisms.

Contribution

The paper proposes a new eagle-inspired vision transformer architecture with Bi-Fovea self-attention and hierarchical processing, improving accuracy and computational efficiency.

Findings

01

EViT achieves competitive results in image classification, object detection, and segmentation.

02

EViT demonstrates superior performance and efficiency compared to existing models.

03

The proposed architecture effectively mimics biological visual processing.

Abstract

Owing to advancements in deep learning technology, Vision Transformers (ViTs) have demonstrated impressive performance in various computer vision tasks. Nonetheless, ViTs still face some challenges, such as high computational complexity and the absence of desirable inductive biases. To alleviate these issues, {the potential advantages of combining eagle vision with ViTs are explored. We summarize a Bi-Fovea Visual Interaction (BFVI) structure inspired by the unique physiological and visual characteristics of eagle eyes. A novel Bi-Fovea Self-Attention (BFSA) mechanism and Bi-Fovea Feedforward Network (BFFN) are proposed based on this structural design approach, which can be used to mimic the hierarchical and parallel information processing scheme of the biological visual cortex, enabling networks to learn feature representations of targets in a coarse-to-fine manner. Furthermore, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nkusyl/evit
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsRemote-Sensing Image Classification · Currency Recognition and Detection · Visual Attention and Saliency Detection

MethodsConvolution · Feedforward Network · Multi-Head Attention · Attention Is All You Need · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Linear Layer · Label Smoothing · Adam · Absolute Position Encodings