LASER: Attention with Exponential Transformation
Sai Surya Duvvuri, Inderjit S. Dhillon

TL;DR
This paper introduces LASER, a new attention mechanism that enhances gradient flow in transformers, leading to improved performance across language, vision, and speech tasks, with minimal implementation changes.
Contribution
LASER provides a theoretically justified modification to attention that increases gradient signals, improving learning efficiency and performance in large-scale models.
Findings
Up to 1.44% improvement on downstream tasks
Enhanced generalization across vision, speech, and text
Achieved better finetuning results with minimal changes
Abstract
Transformers have had tremendous impact for several sequence related tasks, largely due to their ability to retrieve from any part of the sequence via softmax based dot-product attention. This mechanism plays a crucial role in Transformer's performance. We analyze the gradients backpropagated through the softmax operation in the attention mechanism and observe that these gradients can often be small. This poor gradient signal backpropagation can lead to inefficient learning of parameters preceeding the attention operations. To this end, we introduce a new attention mechanism called LASER, which we analytically show to admit a larger gradient signal. We show that LASER attention can be implemented by making small modifications to existing attention implementations. We conduct experiments on autoregressive large language models (LLMs) with upto 7.7 billion parameters with an average…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Topics3D Surveying and Cultural Heritage · Image Processing and 3D Reconstruction
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Attention Is All You Need · Position-Wise Feed-Forward Layer · Linear Layer · Byte Pair Encoding · Dropout · Absolute Position Encodings · Label Smoothing · Transformer · Dense Connections
