Efficient Large Language Model Inference with Neural Block Linearization
Mete Erdogan, Francesco Tonin, Volkan Cevher

TL;DR
This paper introduces Neural Block Linearization, a method to accelerate large language model inference by replacing self-attention with linear approximations, achieving significant speed-ups with minimal accuracy loss.
Contribution
The paper presents Neural Block Linearization, a novel framework that efficiently approximates transformer layers for faster inference without fine-tuning, using a theoretical error bound for layer selection.
Findings
32% inference speed-up on LLMs
Less than 1% accuracy trade-off
Applicable to pre-trained models without fine-tuning
Abstract
The high inference demands of transformer-based Large Language Models (LLMs) pose substantial challenges in their deployment. To this end, we introduce Neural Block Linearization (NBL), a novel framework for accelerating transformer model inference by replacing self-attention layers with linear approximations derived from Linear Minimum Mean Squared Error estimators. NBL leverages Canonical Correlation Analysis to compute a theoretical upper bound on the approximation error. Then, we use this bound as a criterion for substitution, selecting the LLM layers with the lowest linearization error. NBL can be efficiently applied to pre-trained LLMs without the need for fine-tuning. In experiments, NBL achieves notable computational speed-ups while preserving competitive accuracy on multiple reasoning benchmarks. For instance, applying NBL to 12 self-attention layers in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
