A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs

Chen Zhang; Yan Ding; Haotian Wang; Chubo Liu; Keqin Li; Kenli Li

arXiv:2604.09752·cs.DC·April 16, 2026

A-IO: Adaptive Inference Orchestration for Memory-Bound NPUs

Chen Zhang, Yan Ding, Haotian Wang, Chubo Liu, Keqin Li, Kenli Li

PDF

TL;DR

This paper introduces A-IO, a novel approach for adaptive inference orchestration to address memory-bound challenges during LLM deployment on heterogeneous NPUs, improving efficiency and scalability.

Contribution

It proposes a dynamic inference orchestration framework that overcomes static deployment limitations and kernel synchronization overheads in NPU-based LLM inference.

Findings

01

A-IO reduces inference latency on heterogeneous NPUs.

02

It mitigates the Model Scaling Paradox in LLM deployment.

03

The approach improves resource utilization during autoregressive decoding.

Abstract

During the deployment of Large Language Models (LLMs), the autoregressive decoding phase on heterogeneous NPU platforms (e.g., Ascend 910B) faces severe memory-bound challenges. This study reveals the ``Model Scaling Paradox'' caused by the static deployment of single-sized models. It also points out the kernel synchronization overhead of fine-grained speculative decoding \cite{leviathan2023fast, chen2023speculative} under NPU computational graph compilation, and the severe limitations of purely relying on micro-level acceleration algorithms like Prompt LookUp Decoding (PLD)

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.