LPU: A Latency-Optimized and Highly Scalable Processor for Large   Language Model Inference

Seungjae Moon; Jung-Hoon Kim; Junsoo Kim; Seongmin Hong; Junseo Cha,; Minsu Kim; Sukbin Lim; Gyubin Choi; Dongjin Seo; Jongho Kim; Hunjong Lee,; Hyunjun Park; Ryeowook Ko; Soongyu Choi; Jongse Park; Jinwon Lee; Joo-Young; Kim

arXiv:2408.07326·cs.AR·August 15, 2024

LPU: A Latency-Optimized and Highly Scalable Processor for Large Language Model Inference

Seungjae Moon, Jung-Hoon Kim, Junsoo Kim, Seongmin Hong, Junseo Cha,, Minsu Kim, Sukbin Lim, Gyubin Choi, Dongjin Seo, Jongho Kim, Hunjong Lee,, Hyunjun Park, Ryeowook Ko, Soongyu Choi, Jongse Park, Jinwon Lee, Joo-Young, Kim

PDF

TL;DR

This paper presents LPU, a latency-optimized, scalable processor architecture designed to accelerate large language model inference, achieving significant speed and energy efficiency improvements over GPUs.

Contribution

The paper introduces LPU, a novel processor architecture with a streamlined dataflow and expandable synchronization, optimized for large language model inference.

Findings

01

LPU achieves 1.25 ms/token for 1.3B models, outperforming GPUs.

02

LPU is 2.09x faster than GPU for 1.3B models.

03

LPU-based servers are more energy-efficient than NVIDIA H100 and L4 servers.

Abstract

The explosive arrival of OpenAI's ChatGPT has fueled the globalization of large language model (LLM), which consists of billions of pretrained parameters that embodies the aspects of syntax and semantics. HyperAccel introduces latency processing unit (LPU), a latency-optimized and highly scalable processor architecture for the acceleration of LLM inference. LPU perfectly balances the memory bandwidth and compute logic with streamlined dataflow to maximize performance and efficiency. LPU is equipped with expandable synchronization link (ESL) that hides data synchronization latency between multiple LPUs. HyperDex complements LPU as an intuitive software framework to run LLM applications. LPU achieves 1.25 ms/token and 20.9 ms/token for 1.3B and 66B model, respectively, which is 2.09x and 1.37x faster than the GPU. LPU, synthesized using Samsung 4nm process, has total area of 0.824 mm2 and…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.