Does Self-Attention Need Separate Weights in Transformers?
Md Kowsher, Nusrat Jahan Prottasha, Chun-Nam Yu, Ozlem Ozmen Garibay,, Niloofar Yousefi

TL;DR
This paper proposes a shared weight self-attention mechanism in BERT that significantly reduces parameters and training time while improving accuracy on small tasks and robustness to noisy data.
Contribution
It introduces a novel shared weight self-attention approach in BERT, decreasing parameter count and training time, with improved performance over standard models.
Findings
Parameter size reduced by 66.53% in attention blocks.
Accuracy improvements of up to 5.81% on GLUE tasks.
Enhanced generalization on noisy and out-of-domain data.
Abstract
The success of self-attention lies in its ability to capture long-range dependencies and enhance context understanding, but it is limited by its computational complexity and challenges in handling sequential data with inherent directionality. This work introduces a shared weight self-attention-based BERT model that only learns one weight matrix for (Key, Value, and Query) representations instead of three individual matrices for each of them. Our shared weight attention reduces the training parameter size by more than half and training time by around one-tenth. Furthermore, we demonstrate higher prediction accuracy on small tasks of GLUE over the BERT baseline and in particular a generalization power on noisy and out-of-domain data. Experimental results indicate that our shared self-attention method achieves a parameter size reduction of 66.53% in the attention block. In the GLUE…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsEEG and Brain-Computer Interfaces · Neural dynamics and brain function · Machine Learning in Materials Science
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Softmax · Dropout · Dense Connections · Layer Normalization · Linear Layer · Multi-Head Attention · Weight Decay · Linear Warmup With Linear Decay · WordPiece
