LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference
Seungjae Moon, Jung-Hoon Kim, Junsoo Kim, Seongmin Hong, Junseo Cha,, Minsu Kim, Sukbin Lim, Gyubin Choi, Dongjin Seo, Jongho Kim, Hunjong Lee,, Hyunjun Park, Ryeowook Ko, Soongyu Choi, Jongse Park, Jinwon Lee, Joo-Young, Kim

TL;DR
This paper presents LPU, a latency-optimized, scalable processor architecture designed to accelerate large language model inference, achieving significant speed and energy efficiency improvements over GPUs.
Contribution
The paper introduces LPU, a novel processor architecture with a streamlined dataflow and expandable synchronization, optimized for large language model inference.
Findings
LPU achieves 1.25 ms/token for 1.3B models, outperforming GPUs.
LPU is 2.09x faster than GPU for 1.3B models.
LPU-based servers are more energy-efficient than NVIDIA H100 and L4 servers.
Abstract
The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
