Local-to-Global Self-Attention in Vision Transformers
Jinpeng Li, Yichao Yan, Shengcai Liao, Xiaokang Yang, Ling Shao

TL;DR
This paper introduces a multi-path Transformer architecture that enables local-to-global reasoning at multiple levels, improving efficiency and performance in vision tasks like classification and segmentation.
Contribution
It proposes a novel multi-path structure for Vision Transformers that facilitates local-to-global reasoning across multiple granularities, enhancing effectiveness with minimal computational cost.
Findings
Achieves notable improvements in image classification accuracy.
Enhances semantic segmentation performance.
Maintains computational efficiency with marginal overhead.
Abstract
Transformers have demonstrated great potential in computer vision tasks. To avoid dense computations of self-attentions in high-resolution visual data, some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows. This design significantly improves the efficiency but lacks global feature reasoning in early stages. In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage. The proposed framework is computationally efficient and highly effective. With a marginal increasement in computational overhead, our model achieves notable improvements in both image classification and semantic segmentation. Code is available at https://github.com/ljpadam/LG-Transformer
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInfrared Target Detection Methodologies · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Label Smoothing · Residual Connection · Dense Connections · Softmax
