Symmetric Dot-Product Attention for Efficient Training of BERT Language   Models

Martin Courtois; Malte Ostendorff; Leonhard Hennig; Georg Rehm

arXiv:2406.06366·cs.CL·June 21, 2024

Symmetric Dot-Product Attention for Efficient Training of BERT Language Models

Martin Courtois, Malte Ostendorff, Leonhard Hennig, Georg Rehm

PDF

Open Access 1 Video

TL;DR

This paper introduces a symmetric dot-product attention mechanism that improves BERT training efficiency by reducing parameters and training steps while slightly increasing performance on the GLUE benchmark.

Contribution

It proposes a novel symmetric attention function that enhances efficiency and performance of BERT-like models compared to traditional scaled dot-product attention.

Findings

01

Achieves 79.36 score on GLUE benchmark, surpassing traditional attention.

02

Reduces trainable parameters by 6%.

03

Halves the number of training steps needed for convergence.

Abstract

Initially introduced as a machine translation model, the Transformer architecture has now become the foundation for modern deep learning architecture, with applications in a wide range of fields, from computer vision to natural language processing. Nowadays, to tackle increasingly more complex tasks, Transformer-based models are stretched to enormous sizes, requiring increasingly larger training datasets, and unsustainable amount of compute resources. The ubiquitous nature of the Transformer and its core component, the attention mechanism, are thus prime targets for efficiency research. In this work, we propose an alternative compatibility function for the self-attention mechanism introduced by the Transformer architecture. This compatibility function exploits an overlap in the learned representation of the traditional scaled dot-product attention, leading to a symmetric with pairwise…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Symmetric Dot-Product Attention for Efficient Training of BERT Language Models· underline

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques

MethodsAttention Is All You Need · Softmax · Layer Normalization · Linear Layer · Byte Pair Encoding · Label Smoothing · Adam · Residual Connection · Multi-Head Attention · Position-Wise Feed-Forward Layer