Technology solutions targeting the performance of gen-AI inference in resource constrained platforms

Joyjit Kundu; Joshua Klein; Aakash Patel; Dwaipayan Biswas

arXiv:2604.11128·cs.AR·April 14, 2026

Technology solutions targeting the performance of gen-AI inference in resource constrained platforms

Joyjit Kundu, Joshua Klein, Aakash Patel, Dwaipayan Biswas

PDF

TL;DR

This paper evaluates emerging memory and bandwidth solutions, like High Bandwidth Storage and bonded buffer memory, to improve generative AI inference performance on resource-constrained devices.

Contribution

It provides a hierarchical roofline-based analysis of memory solutions for large and small models, addressing capacity and bandwidth challenges.

Findings

01

High Bandwidth Storage improves throughput for large models.

02

Bonded global buffer memory enhances small model performance.

03

Bandwidth and latency requirements are critical for interactivity.

Abstract

The rise of generative AI workloads, particularly language model inference, is intensifying on/off-chip memory pressure. Multimodal inputs such as video streams or images and downstream applications like Question Answering (QA) and analysis over large documents incur long context lengths, requiring caching of massive Key and Value states of the previous tokens. Even a low degree of concurrent inference serving on resource-constrained devices, like mobiles, can further add to memory capacity pressure and runtime memory management complexity. In this paper, we evaluate the performance implications of two emerging technology solutions to alleviate the memory pressure in terms of both capacity and bandwidth using a hierarchical roofline-based analytical performance model. For large models (e.g., 13B parameters) and context lengths, we investigate the performance implications of High…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.