Improving the Performance of DNN-based Software Services using Automated Layer Caching
Mohammadamin Abedi, Yanni Iouannou, Pooyan Jamshidi, Hadi Hemmati

TL;DR
This paper introduces an automated online layer caching mechanism for DNNs that enables early exits during inference, significantly reducing computational complexity and latency without compromising accuracy.
Contribution
It presents a novel online caching approach using self-distillation and early exits, suitable for pre-trained models and real-time applications.
Findings
Reduced computational complexity by up to 58%.
Improved inference latency by up to 46%.
Maintained accuracy with minimal loss.
Abstract
Deep Neural Networks (DNNs) have become an essential component in many application domains including web-based services. A variety of these services require high throughput and (close to) real-time features, for instance, to respond or react to users' requests or to process a stream of incoming data on time. However, the trend in DNN design is toward larger models with many layers and parameters to achieve more accurate results. Although these models are often pre-trained, the computational complexity in such large models can still be relatively significant, hindering low inference latency. Implementing a caching mechanism is a typical systems engineering solution for speeding up a service response time. However, traditional caching is often not suitable for DNN-based services. In this paper, we propose an end-to-end automated solution to improve the performance of DNN-based services in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsIoT and Edge/Fog Computing · Advanced Neural Network Applications · Data Stream Mining Techniques
Methodstravel james · Early exiting using confidence measures
