Distributed Inference Performance Optimization for LLMs on CPUs

Pujiang He; Shan Zhou; Changqing Li; Wenhuan Huang; Weifei Yu; Duyi; Wang; Chen Meng; Sheng Gui

arXiv:2407.00029·cs.DC·July 2, 2024

Distributed Inference Performance Optimization for LLMs on CPUs

Pujiang He, Shan Zhou, Changqing Li, Wenhuan Huang, Weifei Yu, Duyi, Wang, Chen Meng, Sheng Gui

PDF

Open Access 1 Repo

TL;DR

This paper presents an efficient distributed inference optimization method for large language models on CPUs, significantly reducing inference time and enabling deployment on resource-limited hardware.

Contribution

The paper introduces a novel distributed inference optimization technique specifically designed for LLMs on CPU architectures, improving performance and resource utilization.

Findings

01

Inference time for 72B parameter LLM reduced to 140 ms/token

02

Achieved faster inference than average human reading speed

03

Demonstrated effectiveness on 5th Gen Intel Xeon processors

Abstract

Large language models (LLMs) hold tremendous potential for addressing numerous real-world challenges, yet they typically demand significant computational resources and memory. Deploying LLMs onto a resource-limited hardware device with restricted memory capacity presents considerable challenges. Distributed computing emerges as a prevalent strategy to mitigate single-node memory constraints and expedite LLM inference performance. To reduce the hardware limitation burden, we proposed an efficient distributed inference optimization solution for LLMs on CPUs. We conduct experiments with the proposed solution on 5th Gen Intel Xeon Scalable Processors, and the result shows the time per output token for the LLM with 72B parameter is 140 ms/token, much faster than the average human reading speed about 200ms per token.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

intel/xfastertransformer
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDistributed and Parallel Computing Systems