Axially Expanded Windows for Local-Global Interaction in Vision Transformers
Zhemin Zhang, Xun Gong

TL;DR
This paper introduces axially expanded window self-attention in Vision Transformers, combining local and coarse global attention to efficiently model both short- and long-range dependencies in high-resolution images.
Contribution
It proposes a novel attention mechanism that balances local detail and global context, improving efficiency and modeling capacity in Vision Transformers.
Findings
Enhanced ability to capture multi-scale dependencies
Improved performance on vision tasks
Efficient computation for high-resolution images
Abstract
Recently, Transformers have shown promising performance in various vision tasks. A challenging issue in Transformer design is that global self-attention is very expensive to compute, especially for the high-resolution vision tasks. Local self-attention performs attention computation within a local region to improve its efficiency, which leads to their receptive fields in a single attention layer are not large enough, resulting in insufficient context modeling. When observing a scene, humans usually focus on a local region while attending to non-attentional regions at coarse granularity. Based on this observation, we develop the axially expanded window self-attention mechanism that performs fine-grained self-attention within the local window and coarse-grained self-attention in the horizontal and vertical axes, and thus can effectively capturing both short- and long-range visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Tactile and Sensory Interactions · CCD and CMOS Imaging Sensors
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Softmax · Dropout · Residual Connection · Dense Connections
