Symmetric Dot-Product Attention for Efficient Training of BERT Language Models
Martin Courtois, Malte Ostendorff, Leonhard Hennig, Georg Rehm

TL;DR
This paper introduces a symmetric dot-product attention mechanism that improves BERT training efficiency by reducing parameters and training steps while slightly increasing performance on the GLUE benchmark.
Contribution
It proposes a novel symmetric attention function that enhances efficiency and performance of BERT-like models compared to traditional scaled dot-product attention.
Findings
Achieves 79.36 score on GLUE benchmark, surpassing traditional attention.
Reduces trainable parameters by 6%.
Halves the number of training steps needed for convergence.
Abstract
Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language processing. Nowadays, to tackle increasingly more complex tasks, Transformer-based models are stretched to enormous sizes, requiring increasingly larger training datasets, and unsustainable amount of compute resources. The ubiquitous nature of the Transformer and its core component, the attention mechanism, are thus prime targets for efficiency research. In this work, we propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture. This compatibility function exploits an overlap in the learned representation of the traditional scaled dot-product attention, leading to a symmetric with pairwise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques
MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer
