Local-to-Global Self-Attention in Vision Transformers

Jinpeng Li; Yichao Yan; Shengcai Liao; Xiaokang Yang; Ling Shao

arXiv:2107.04735·cs.CV·July 13, 2021·22 cites

Local-to-Global Self-Attention in Vision Transformers

Jinpeng Li, Yichao Yan, Shengcai Liao, Xiaokang Yang, Ling Shao

PDF

Open Access

TL;DR

This paper introduces a multi-path Transformer architecture that enables local-to-global reasoning at multiple levels, improving efficiency and performance in vision tasks like classification and segmentation.

Contribution

It proposes a novel multi-path structure for Vision Transformers that facilitates local-to-global reasoning across multiple granularities, enhancing effectiveness with minimal computational cost.

Findings

01

Achieves notable improvements in image classification accuracy.

02

Enhances semantic segmentation performance.

03

Maintains computational efficiency with marginal overhead.

Abstract

Transformers have demonstrated great potential in computer vision tasks. To avoid dense computations of self-attentions in high-resolution visual data, some recent Transformer models adopt a hierarchical design, where self-attentions are only computed within local windows. This design significantly improves the efficiency but lacks global feature reasoning in early stages. In this work, we design a multi-path structure of the Transformer, which enables local-to-global reasoning at multiple granularities in each stage. The proposed framework is computationally efficient and highly effective. With a marginal increasement in computational overhead, our model achieves notable improvements in both image classification and semantic segmentation. Code is available at https://github.com/ljpadam/LG-Transformer

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsInfrared Target Detection Methodologies · Visual Attention and Saliency Detection · CCD and CMOS Imaging Sensors

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Adam · Label Smoothing · Residual Connection · Dense Connections · Softmax