Optimizing Prediction Serving on Low-Latency Serverless Dataflow

Vikram Sreekanti; Harikaran Subbaraj; Chenggang Wu; Joseph E.; Gonzalez; Joseph M. Hellerstein

arXiv:2007.05832·cs.DC·July 14, 2020·6 cites

Optimizing Prediction Serving on Low-Latency Serverless Dataflow

Vikram Sreekanti, Harikaran Subbaraj, Chenggang Wu, Joseph E., Gonzalez, Joseph M. Hellerstein

PDF

Open Access

TL;DR

This paper introduces Cloudflow, a serverless dataflow system optimized for low-latency prediction serving, achieving up to 2x performance improvements over existing systems through operator fusion and execution optimizations.

Contribution

The paper presents Cloudflow, a novel serverless dataflow system that optimizes prediction serving with a familiar API and performance enhancements without modifying ML models.

Findings

01

Cloudflow achieves up to 2x performance gains over state-of-the-art systems.

02

Operator fusion and optimized execution are key to performance improvements.

03

Cloudflow successfully meets latency goals in real-time prediction tasks.

Abstract

Prediction serving systems are designed to provide large volumes of low-latency inferences machine learning models. These systems mix data processing and computationally intensive model inference and benefit from multiple heterogeneous processors and distributed computing resources. In this paper, we argue that a familiar dataflow API is well-suited to this latency-sensitive task, and amenable to optimization even with unmodified black-box ML models. We present the design of Cloudflow, a system that provides this API and realizes it on an autoscaling serverless backend. Cloudflow transparently implements performance-critical optimizations including operator fusion and competitive execution. Our evaluation shows that Cloudflow's optimizations yield significant performance improvements on synthetic workloads and that Cloudflow outperforms state-of-the-art prediction serving systems by as…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Parallel Computing and Optimization Techniques · Cloud Computing and Resource Management