Chrion: Optimizing Recurrent Neural Network Inference by Collaboratively Utilizing CPUs and GPUs
Zinuo Cai, Hao Wang, Tao Song, Yang Hua, Ruhui Ma, Haibing Guan

TL;DR
Chrion is a system that optimizes recurrent neural network inference by intelligently partitioning and scheduling computations across CPUs and GPUs in cloud clusters, significantly reducing latency and memory usage.
Contribution
It formulates the deployment as an NP-hard scheduling problem and proposes a method to partition models for efficient execution on heterogeneous devices.
Findings
Up to 19.4% reduction in execution latency.
GPU memory footprint reduced by 67.5%.
Effective model partitioning improves inference performance.
Abstract
Deploying deep learning models in cloud clusters provides efficient and prompt inference services to accommodate the widespread application of deep learning. These clusters are usually equipped with host CPUs and accelerators with distinct responsibilities to handle serving requests, i.e. generalpurpose CPUs for input preprocessing and domain-specific GPUs for forward computation. Recurrent neural networks play an essential role in handling temporal inputs and display distinctive computation characteristics because of their high inter-operator parallelism. Hence, we propose Chrion to optimize recurrent neural network inference by collaboratively utilizing CPUs and GPUs. We formulate the model deployment in the CPU-GPU cluster as an NP-hard scheduling problem of directed acyclic graphs on heterogeneous devices. Given an input model in the ONNX format and user-defined SLO requirement,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Advanced Graph Neural Networks · Graph Theory and Algorithms
