LLM Inference Acceleration via Efficient Operation Fusion

Mahsa Salmani; Ilya Soloveychik

arXiv:2502.17728·cs.CL·February 26, 2025

LLM Inference Acceleration via Efficient Operation Fusion

Mahsa Salmani, Ilya Soloveychik

PDF

Open Access

TL;DR

This paper introduces an efficient operation fusion technique that defers normalization steps in Transformer-based LLMs, enabling concurrent computation with linear layers to accelerate inference without accuracy loss.

Contribution

The proposed method significantly reduces inference latency by hiding normalization overhead through parallelization, improving hardware utilization in large language models.

Findings

01

Inference latency reduced by approximately 20%

02

Normalization overhead effectively hidden behind matrix multiplication

03

Numerical accuracy preserved during operation fusion

Abstract

The rapid development of the Transformer-based Large Language Models (LLMs) in recent years has been closely linked to their ever-growing and already enormous sizes. Many LLMs contain hundreds of billions of parameters and require dedicated hardware resources for training and inference. One of the key challenges inherent to the Transformer architecture is the requirement to support numerous non-linear transformations that involves normalization. For instance, each decoder block typically contains at least one Softmax operation and two Layernorms. The computation of the corresponding normalization scaling factors becomes a major bottleneck as it requires spatial collective operations. In other words, when it comes to the computation of denominators for Softmax and Layernorm, all vector elements must be aggregated into a single location, requiring significant communication. These…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsFault Detection and Control Systems · Neural Networks and Applications

MethodsAttention Is All You Need · Absolute Position Encodings · Dense Connections · Layer Normalization · Byte Pair Encoding · Residual Connection · Label Smoothing · Multi-Head Attention · Position-Wise Feed-Forward Layer · Adam