LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models

Stanislav Budzinskiy; Marian Gloser; Tolunay Yilmaz; Ying Hong Tham; Yuanyi Lin; Wenyi Fang; Fan Wu; Philipp Petersen

arXiv:2601.21623·cs.LG·May 8, 2026

LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models

Stanislav Budzinskiy, Marian Gloser, Tolunay Yilmaz, Ying Hong Tham, Yuanyi Lin, Wenyi Fang, Fan Wu, Philipp Petersen

PDF

TL;DR

This paper introduces LAMP, an adaptive mixed-precision inference method for large language models that selectively increases accuracy in critical components, significantly improving inference precision with minimal recomputation.

Contribution

The paper presents a novel adaptive strategy for mixed-precision computation in transformer inference, optimizing accuracy while reducing computational overhead.

Findings

01

Up to two orders of magnitude accuracy improvement on GPT-2 models.

02

Very low recomputation rates achieve significant precision gains.

03

Adaptive component selection enhances transformer inference efficiency.

Abstract

Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition $f (g (x))$ , we provide an adaptive strategy that selects a small subset of components of $g (x)$ to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.