TL;DR
This paper introduces CENT, a CXL-enabled GPU-free system that significantly improves large language model inference efficiency and cost-effectiveness by leveraging memory expansion and high bandwidth memory access.
Contribution
CENT is the first system to utilize CXL memory expansion and peer-to-peer communication for efficient GPU-free LLM inference, outperforming GPU-based systems in throughput and energy efficiency.
Findings
2.3× higher throughput than GPU baselines
2.9× less energy consumption
5.2× more tokens per dollar
Abstract
Large Language Model (LLM) inference uses an autoregressive manner to generate one token at a time, which exhibits notably lower operational intensity compared to earlier Machine Learning (ML) models such as encoder-only transformers and Convolutional Neural Networks. At the same time, LLMs possess large parameter sizes and use key-value caches to store context information. Modern LLMs support context windows with up to 1 million tokens to generate versatile text, audio, and video content. A large key-value cache unique to each prompt requires a large memory capacity, limiting the inference batch size. Both low operational intensity and limited batch size necessitate a high memory bandwidth. However, contemporary hardware systems for ML model deployment, such as GPUs and TPUs, are primarily optimized for compute throughput. This mismatch challenges the efficient deployment of advanced…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
