TL;DR
ANNETTE introduces a stacked model framework for accurately estimating DNN inference latency on hardware accelerators, aiding architectural design without hardware dependence.
Contribution
It proposes a novel stacked modeling approach that improves latency estimation accuracy for diverse neural networks on various hardware accelerators.
Findings
Average estimation error of 3.47% on DNNDK
Fidelity of 0.988 on NASBench networks
Outperforms existing statistical and analytical models
Abstract
With new accelerator hardware for DNN, the computing power for AI applications has increased rapidly. However, as DNN algorithms become more complex and optimized for specific applications, latency requirements remain challenging, and it is critical to find the optimal points in the design space. To decouple the architectural search from the target hardware, we propose a time estimation framework that allows for modeling the inference latency of DNNs on hardware accelerators based on mapping and layer-wise estimation models. The proposed methodology extracts a set of models from micro-kernel and multi-layer benchmarks and generates a stacked model for mapping and network execution time estimation. We compare estimation accuracy and fidelity of the generated mixed models, statistical models with the roofline model, and a refined roofline model for evaluation. We test the mixed models on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
