LAMP: Look-Ahead Mixed-Precision Inference of Large Language Models
Stanislav Budzinskiy, Marian Gloser, Tolunay Yilmaz, Ying Hong Tham, Yuanyi Lin, Wenyi Fang, Fan Wu, Philipp Petersen

TL;DR
This paper introduces LAMP, an adaptive mixed-precision inference method for large language models that selectively increases accuracy in critical components, significantly improving inference precision with minimal recomputation.
Contribution
The paper presents a novel adaptive strategy for mixed-precision computation in transformer inference, optimizing accuracy while reducing computational overhead.
Findings
Up to two orders of magnitude accuracy improvement on GPT-2 models.
Very low recomputation rates achieve significant precision gains.
Adaptive component selection enhances transformer inference efficiency.
Abstract
Mixed-precision computations are a hallmark of the current stage of AI, driving the progress in large language models towards efficient, locally deployable solutions. This article addresses the floating-point computation of compositionally-rich functions, concentrating on transformer inference. Based on the rounding error analysis of a composition , we provide an adaptive strategy that selects a small subset of components of to be computed more accurately while all other computations can be carried out with lower accuracy. We then explain how this strategy can be applied to different compositions within a transformer and illustrate its overall effect on transformer inference. We study the effectiveness of this algorithm numerically on GPT-2 models and demonstrate that already very low recomputation rates allow for improvements of up to two orders of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
