Fcaformer: Forward Cross Attention in Hybrid Vision Transformer

Haokui Zhang; Wenze Hu; Xiaoyu Wang

arXiv:2211.07198·cs.CV·March 21, 2023·1 cites

Fcaformer: Forward Cross Attention in Hybrid Vision Transformer

Haokui Zhang, Wenze Hu, Xiaoyu Wang

PDF

Open Access 2 Repos

TL;DR

FcaFormer introduces forward cross attention with learnable scale factors and token merge modules to densify attention in vision transformers, improving performance while reducing computational costs.

Contribution

The paper proposes a novel forward cross attention mechanism and associated modules to enhance token interactions across blocks in vision transformers, achieving better efficiency and accuracy.

Findings

01

Achieves 83.1% top-1 accuracy on ImageNet with 16.3M parameters.

02

Reduces parameters and computational costs compared to previous models.

03

Improves information flow across transformer blocks.

Abstract

Currently, one main research line in designing a more efficient vision transformer is reducing the computational cost of self attention modules by adopting sparse attention or using local attention windows. In contrast, we propose a different approach that aims to improve the performance of transformer-based architectures by densifying the attention pattern. Specifically, we proposed forward cross attention for hybrid vision transformer (FcaFormer), where tokens from previous blocks in the same stage are secondary used. To achieve this, the FcaFormer leverages two innovative components: learnable scale factors (LSFs) and a token merge and enhancement module (TME). The LSFs enable efficient processing of cross tokens, while the TME generates representative cross tokens. By integrating these components, the proposed FcaFormer enhances the interactions of tokens across blocks with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Memory and Neural Computing · Advanced Neural Network Applications · Visual Attention and Saliency Detection

MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Residual Connection · Dense Connections · Knowledge Distillation · Layer Normalization · Vision Transformer · Linear Layer