Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective

Jinhao Li; Jiaming Xu; Shan Huang; Yonghua Chen; Wen Li; Jun Liu; Yaoxiu Lian; Jiayi Pan; Li Ding; Hao Zhou; Yu Wang; Guohao Dai

arXiv:2410.04466·cs.AR·June 16, 2025·3 cites

Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective

Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Yu Wang, Guohao Dai

PDF

Open Access 1 Repo

TL;DR

This paper provides a comprehensive survey of hardware-based optimization techniques for accelerating generative large language model inference across various platforms, analyzing performance, energy efficiency, and future trends.

Contribution

It systematically compares inference acceleration methods across hardware platforms and highlights emerging trends like multimodality and energy-efficient inference.

Findings

01

Hardware-specific optimization methods improve inference speed.

02

Energy efficiency varies significantly across platforms.

03

Multimodality and inference-time compute are promising future directions.

Abstract

Large Language Models (LLMs) have demonstrated remarkable capabilities across various fields, from natural language understanding to text generation. Compared to non-generative LLMs like BERT and DeBERTa, generative LLMs like GPT series and Llama series are currently the main focus due to their superior algorithmic performance. The advancements in generative LLMs are closely intertwined with the development of hardware capabilities. Various hardware platforms exhibit distinct hardware characteristics, which can help improve LLM inference performance. Therefore, this paper comprehensively surveys efficient generative LLM inference on different hardware platforms. First, we provide an overview of the algorithm architecture of mainstream generative LLMs and delve into the inference process. Then, we summarize different optimization methods for different platforms such as CPU, GPU, FPGA,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

kimho666/llm_hardware_survey
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Data Processing Techniques

MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · How do I file a dispute with Expedia?*DisputeFastService · Attention Is All You Need · WordPiece · Linear Layer · Residual Connection · Weight Decay · Cosine Annealing · Linear Warmup With Linear Decay · Dropout