Technology solutions targeting the performance of gen-AI inference in resource constrained platforms
Joyjit Kundu, Joshua Klein, Aakash Patel, Dwaipayan Biswas

TL;DR
This paper evaluates emerging memory and bandwidth solutions, like High Bandwidth Storage and bonded buffer memory, to improve generative AI inference performance on resource-constrained devices.
Contribution
It provides a hierarchical roofline-based analysis of memory solutions for large and small models, addressing capacity and bandwidth challenges.
Findings
High Bandwidth Storage improves throughput for large models.
Bonded global buffer memory enhances small model performance.
Bandwidth and latency requirements are critical for interactivity.
Abstract
The rise of generative AI workloads, particularly language model inference, is intensifying on/off-chip memory pressure. Multimodal inputs such as video streams or images and downstream applications like Question Answering (QA) and analysis over large documents incur long context lengths, requiring caching of massive Key and Value states of the previous tokens. Even a low degree of concurrent inference serving on resource-constrained devices, like mobiles, can further add to memory capacity pressure and runtime memory management complexity. In this paper, we evaluate the performance implications of two emerging technology solutions to alleviate the memory pressure in terms of both capacity and bandwidth using a hierarchical roofline-based analytical performance model. For large models (e.g., 13B parameters) and context lengths, we investigate the performance implications of High…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
