TL;DR
{}-ORCA is a specialized accelerator framework designed for ultra-low-latency DNN inference on AMD ACAP platforms, enabling direct layer communication and optimized design to meet microsecond latency targets.
Contribution
The paper introduces {}-ORCA, a novel framework that optimizes inter-layer communication and hardware utilization for microsecond-scale DNN inference on reconfigurable platforms.
Findings
Achieves over 1.70x and 1.83x latency reduction compared to state-of-the-art frameworks.
Attains 0.93 microseconds latency on a 6-layer DeepSets model.
Supports MLP and DeepSets models with non-MM kernels on AMD ACAP.
Abstract
Heterogeneous reconfigurable platforms with tensor cores, such as AMD ACAP, are increasingly adopted for deep neural network (DNN) inference due to their high throughput and flexibility. However, their suitability for microsecond-scale inference on small problem sizes remains underexplored. In jet-tagging applications in high-energy physics, inefficient on-chip communication and large inter-layer latency prevent existing frameworks from meeting the 1-{\mu}s latency budget. Moreover, hardware overheads such as synchronization and VLIW processor prologue are often overlooked, making it infeasible to optimize accelerators correctly. To address these problems, we propose {\mu}-ORCA, a customized heterogeneous accelerator framework for ultra-low-latency model inference. {\mu}-ORCA enables direct inter-layer communication between DNN layers on the AIE array, instead of using shared memory…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
