LLM Inference Acceleration via Efficient Operation Fusion
Mahsa Salmani, Ilya Soloveychik

TL;DR
This paper introduces an efficient operation fusion technique that defers normalization steps in Transformer-based LLMs, enabling concurrent computation with linear layers to accelerate inference without accuracy loss.
Contribution
The proposed method significantly reduces inference latency by hiding normalization overhead through parallelization, improving hardware utilization in large language models.
Findings
Inference latency reduced by approximately 20%
Normalization overhead effectively hidden behind matrix multiplication
Numerical accuracy preserved during operation fusion
Abstract
The rapid development of the Transformer-based Large Language Models (LLMs) in recent years has been closely linked to their ever-growing and already enormous sizes. Many LLMs contain hundreds of billions of parameters and require dedicated hardware resources for training and inference. One of the key challenges inherent to the Transformer architecture is the requirement to support numerous non-linear transformations that involves normalization. For instance, each decoder block typically contains at least one Softmax operation and two Layernorms. The computation of the corresponding normalization scaling factors becomes a major bottleneck as it requires spatial collective operations. In other words, when it comes to the computation of denominators for Softmax and Layernorm, all vector elements must be aggregated into a single location, requiring significant communication. These…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsFault Detection and Control Systems · Neural Networks and Applications
MethodsAttention Is All You Need · Absolute Position Encodings · Dense Connections · Layer Normalization · Byte Pair Encoding · Residual Connection · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam
