{\mu}-ORCA: Optimizing Acceleration for Microsecond-Scale Deep Neural Network Inference on ACAP

Shixin Ji; Jinming Zhuang; Zhuoping Yang; Xingzhen Chen; Wei Zhang; Peipei Zhou

arXiv:2605.17683·cs.AR·May 19, 2026

{\mu}-ORCA: Optimizing Acceleration for Microsecond-Scale Deep Neural Network Inference on ACAP

Shixin Ji, Jinming Zhuang, Zhuoping Yang, Xingzhen Chen, Wei Zhang, Peipei Zhou

PDF

1 Repo

TL;DR

{}-ORCA is a specialized accelerator framework designed for ultra-low-latency DNN inference on AMD ACAP platforms, enabling direct layer communication and optimized design to meet microsecond latency targets.

Contribution

The paper introduces {}-ORCA, a novel framework that optimizes inter-layer communication and hardware utilization for microsecond-scale DNN inference on reconfigurable platforms.

Findings

01

Achieves over 1.70x and 1.83x latency reduction compared to state-of-the-art frameworks.

02

Attains 0.93 microseconds latency on a 6-layer DeepSets model.

03

Supports MLP and DeepSets models with non-MM kernels on AMD ACAP.

Abstract

Heterogeneous reconfigurable platforms with tensor cores, such as AMD ACAP, are increasingly adopted for deep neural network (DNN) inference due to their high throughput and flexibility. However, their suitability for microsecond-scale inference on small problem sizes remains underexplored. In jet-tagging applications in high-energy physics, inefficient on-chip communication and large inter-layer latency prevent existing frameworks from meeting the 1-{\mu}s latency budget. Moreover, hardware overheads such as synchronization and VLIW processor prologue are often overlooked, making it infeasible to optimize accelerators correctly. To address these problems, we propose {\mu}-ORCA, a customized heterogeneous accelerator framework for ultra-low-latency model inference. {\mu}-ORCA enables direct inter-layer communication between DNN layers on the AIE array, instead of using shared memory…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

arc-research-lab/u-ORCA
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.