Efficient Large Language Model Inference with Neural Block Linearization

Mete Erdogan; Francesco Tonin; Volkan Cevher

arXiv:2505.21077·cs.LG·October 21, 2025

Efficient Large Language Model Inference with Neural Block Linearization

Mete Erdogan, Francesco Tonin, Volkan Cevher

PDF

Open Access

TL;DR

This paper introduces Neural Block Linearization, a method to accelerate large language model inference by replacing self-attention with linear approximations, achieving significant speed-ups with minimal accuracy loss.

Contribution

The paper presents Neural Block Linearization, a novel framework that efficiently approximates transformer layers for faster inference without fine-tuning, using a theoretical error bound for layer selection.

Findings

01

32% inference speed-up on LLMs

02

Less than 1% accuracy trade-off

03

Applicable to pre-trained models without fine-tuning

Abstract

The high inference demands of transformer-based Large Language Models (LLMs) pose substantial challenges in their deployment. To this end, we introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference by replacing self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators. NBL leverages Canonical Correlation Analysis to compute a theoretical upper bound on the approximation error. Then, we use this bound as a criterion for substitution, selecting the LLM layers with the lowest linearization error. NBL can be efficiently applied to pre-trained LLMs without the need for fine-tuning. In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks. For instance, applying NBL to 12 self-attention layers in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis