Fcaformer: Forward Cross Attention in Hybrid Vision Transformer
Haokui Zhang, Wenze Hu, Xiaoyu Wang

TL;DR
FcaFormer introduces forward cross attention with learnable scale factors and token merge modules to densify attention in vision transformers, improving performance while reducing computational costs.
Contribution
The paper proposes a novel forward cross attention mechanism and associated modules to enhance token interactions across blocks in vision transformers, achieving better efficiency and accuracy.
Findings
Achieves 83.1% top-1 accuracy on ImageNet with 16.3M parameters.
Reduces parameters and computational costs compared to previous models.
Improves information flow across transformer blocks.
Abstract
Currently, one main research line in designing a more efficient vision transformer is reducing the computational cost of self attention modules by adopting sparse attention or using local attention windows. In contrast, we propose a different approach that aims to improve the performance of transformer-based architectures by densifying the attention pattern. Specifically, we proposed forward cross attention for hybrid vision transformer (FcaFormer), where tokens from previous blocks in the same stage are secondary used. To achieve this, the FcaFormer leverages two innovative components: learnable scale factors (LSFs) and a token merge and enhancement module (TME). The LSFs enable efficient processing of cross tokens, while the TME generates representative cross tokens. By integrating these components, the proposed FcaFormer enhances the interactions of tokens across blocks with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Memory and Neural Computing · Advanced Neural Network Applications · Visual Attention and Saliency Detection
MethodsMulti-Head Attention · Attention Is All You Need · Softmax · Residual Connection · Dense Connections · Knowledge Distillation · Layer Normalization · Vision Transformer · Linear Layer
