DeepCrossAttention: Supercharging Transformer Residual Connections
Mike Heddes, Adel Javanmard, Kyriakos Axiotis, Gang Fu, MohammadHossein Bateni, Vahab Mirrokni

TL;DR
DeepCrossAttention (DCA) enhances transformer residual connections by using learnable, input-dependent weights and depth-wise cross-attention, leading to faster training and improved language modeling performance.
Contribution
DCA introduces a novel residual connection method with dynamic weighting and cross-attention, improving efficiency and effectiveness of transformer models.
Findings
DCA improves perplexity in language modeling tasks.
DCA achieves up to 3x faster training with similar model quality.
Theoretical analysis shows better accuracy-size trade-offs with DCA.
Abstract
Transformer networks have achieved remarkable success across diverse domains, leveraging a variety of architectural innovations, including residual connections. However, traditional residual connections, which simply sum the outputs of previous layers, can dilute crucial information. This work introduces DeepCrossAttention (DCA), an approach that enhances residual learning in transformers. DCA employs learnable, input-dependent weights to dynamically combine layer outputs, enabling the model to selectively focus on the most relevant information in any of the previous layers. Furthermore, DCA incorporates depth-wise cross-attention, allowing for richer interactions between layers at different depths. Our language modeling experiments show that DCA achieves improved perplexity for a given training time. Moreover, DCA obtains the same model quality up to 3x faster while adding a negligible…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHigh voltage insulation and dielectric phenomena · Power Transformer Diagnostics and Insulation · Electrical Fault Detection and Protection
