Horizontal and Vertical Attention in Transformers
Litao Yu, Jian Zhang

TL;DR
This paper introduces horizontal and vertical attention mechanisms to enhance feature representation in Transformers, improving their performance and generalization with minimal additional computational cost.
Contribution
It proposes modular horizontal and vertical attention modules that can be integrated into Transformers to improve feature re-weighting and channel-wise calibration.
Findings
Enhanced Transformer models show improved performance across tasks.
The proposed attentions require minimal additional computation.
The modules are highly modular and adaptable.
Abstract
Transformers are built upon multi-head scaled dot-product attention and positional encoding, which aim to learn the feature representations and token dependencies. In this work, we focus on enhancing the distinctive representation by learning to augment the feature maps with the self-attention mechanism in Transformers. Specifically, we propose the horizontal attention to re-weight the multi-head output of the scaled dot-product attention before dimensionality reduction, and propose the vertical attention to adaptively re-calibrate channel-wise feature responses by explicitly modelling inter-dependencies among different channels. We demonstrate the Transformer models equipped with the two attentions have a high generalization capability across different supervised learning tasks, with a very minor additional computational cost overhead. The proposed horizontal and vertical attentions…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications · Human Pose and Action Recognition
MethodsAttention Is All You Need · Linear Layer · Dense Connections · Absolute Position Encodings · Dropout · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Residual Connection · Multi-Head Attention · Layer Normalization
