Challenges and Research Directions for Large Language Model Inference Hardware
Xiaoyu Ma, David Patterson

TL;DR
This paper discusses the unique challenges of large language model inference, emphasizing memory and interconnect issues, and proposes four architecture research opportunities to improve hardware efficiency for datacenter AI and mobile devices.
Contribution
It identifies key hardware bottlenecks in LLM inference and proposes innovative architectural solutions like high bandwidth flash and processing-near-memory to address these challenges.
Findings
Memory and interconnect are primary bottlenecks in LLM inference.
Proposed hardware solutions can significantly improve inference efficiency.
Applicability of solutions extends to both datacenter and mobile devices.
Abstract
Large Language Model (LLM) inference is hard. The autoregressive Decode phase of the underlying Transformer model makes LLM inference fundamentally different from training. Exacerbated by recent AI trends, the primary challenges are memory and interconnect rather than compute. To address these challenges, we highlight four architecture research opportunities: High Bandwidth Flash for 10X memory capacity with HBM-like bandwidth; Processing-Near-Memory and 3D memory-logic stacking for high memory bandwidth; and low-latency interconnect to speedup communication. While our focus is datacenter AI, we also review their applicability for mobile devices.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Machine Learning in Materials Science · Big Data and Digital Economy
