DPU or GPU for Accelerating Neural Networks Inference -- Why not both? Split CNN Inference
Ali Emre Oztas, Mahir Demir, James Garside, and Mikel Luj'an

TL;DR
This paper proposes a split CNN inference method combining DPU and GPU to reduce latency in edge device video/image streaming, with a GNN-based partition prediction achieving significant performance gains.
Contribution
It introduces a novel partitioning approach for CNN inference across DPU and GPU, including an automated GNN-based prediction method for optimal layer splitting.
Findings
Up to 2.48x latency reduction over DPU-only execution.
Up to 3.37x latency reduction over GPU-only execution.
GNN-based partition prediction achieves 96.27% accuracy.
Abstract
Video and image streaming on edge devices requires low latency. To address this, Neural Networks (NNs) are widely used, and prior work mainly focuses on accelerating them with single hardware units such as Graphics Processing Units (GPUs), Field Programmable Gate Arrays (FPGAs), and Deep Learning Processing Units (DPUs). However, further reductions in latency can be observed by combining these units. In this paper, partitioning CNN inference across DPU and GPU (Split CNN Inference) is proposed. The first partition runs on the AI engines (DPU) of a Versal VCK190, which consists of initial CNN layers processing the input images. The DPU processes the first partition near the source of the data. Pipelined asynchronously, a GPU runs the remaining layers. The GPU (NVIDIA RTX 2080) processes the second partition, albeit having reduced the data transfer between the data source (storage/camera)…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
