Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective
Jinhao Li, Jiaming Xu, Shan Huang, Yonghua Chen, Wen Li, Jun Liu, Yaoxiu Lian, Jiayi Pan, Li Ding, Hao Zhou, Yu Wang, Guohao Dai

TL;DR
This paper provides a comprehensive survey of hardware-based optimization techniques for accelerating generative large language model inference across various platforms, analyzing performance, energy efficiency, and future trends.
Contribution
It systematically compares inference acceleration methods across hardware platforms and highlights emerging trends like multimodality and energy-efficient inference.
Findings
Hardware-specific optimization methods improve inference speed.
Energy efficiency varies significantly across platforms.
Multimodality and inference-time compute are promising future directions.
Abstract
Large Language Models (LLMs) have demonstrated remarkable capabilities across various fields, from natural language understanding to text generation. Compared to non-generative LLMs like BERT and DeBERTa, generative LLMs like GPT series and Llama series are currently the main focus due to their superior algorithmic performance. The advancements in generative LLMs are closely intertwined with the development of hardware capabilities. Various hardware platforms exhibit distinct hardware characteristics, which can help improve LLM inference performance. Therefore, this paper comprehensively surveys efficient generative LLM inference on different hardware platforms. First, we provide an overview of the algorithm architecture of mainstream generative LLMs and delve into the inference process. Then, we summarize different optimization methods for different platforms such as CPU, GPU, FPGA,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Natural Language Processing Techniques · Advanced Data Processing Techniques
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · How do I file a dispute with Expedia?*DisputeFastService · Attention Is All You Need · WordPiece · Linear Layer · Residual Connection · Weight Decay · Cosine Annealing · Linear Warmup With Linear Decay · Dropout
