Brain-Inspired Stepwise Patch Merging for Vision Transformers
Yonghao Yu, Dongcheng Zhao, Guobin Shen, Yiting Dong, Yi Zeng

TL;DR
This paper introduces Stepwise Patch Merging (SPM), a brain-inspired hierarchical approach for Vision Transformers that improves global and local feature integration, leading to enhanced performance in vision tasks.
Contribution
The paper proposes SPM, a novel patch merging method combining multi-scale aggregation and guided local enhancement, inspired by brain mechanisms, to improve ViT hierarchical architecture.
Findings
SPM improves accuracy on ImageNet-1K, COCO, and ADE20K datasets.
SPM enhances dense prediction tasks like object detection and segmentation.
Combining SPM with different backbones yields further performance gains.
Abstract
The hierarchical architecture has become a mainstream design paradigm for Vision Transformers (ViTs), with Patch Merging serving as the pivotal component that transforms a columnar architecture into a hierarchical one. Drawing inspiration from the brain's ability to integrate global and local information for comprehensive visual understanding, we propose Stepwise Patch Merging (SPM), which enhances the subsequent attention mechanism's ability to 'see' better. SPM consists of Multi-Scale Aggregation (MSA) and Guided Local Enhancement (GLE) striking a proper balance between long-range dependency modeling and local feature enhancement. Extensive experiments conducted on benchmark datasets, including ImageNet-1K, COCO, and ADE20K, demonstrate that SPM significantly improves the performance of various models, particularly in dense prediction tasks such as object detection and semantic…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · CCD and CMOS Imaging Sensors · EEG and Brain-Computer Interfaces
MethodsSoftmax · Attention Is All You Need
