Scaling TensorFlow to 300 million predictions per second

Jan Hartman; Davorin Kopi\v{c}

arXiv:2109.09541·cs.LG·September 21, 2021

Scaling TensorFlow to 300 million predictions per second

Jan Hartman, Davorin Kopi\v{c}

PDF

TL;DR

This paper discusses scaling TensorFlow to handle 300 million predictions per second in an online advertising setting, focusing on challenges and optimization techniques for low-latency model serving.

Contribution

It details the process of transitioning large-scale machine learning models to TensorFlow and optimizing their deployment for high-throughput, low-latency predictions.

Findings

01

Achieved 300 million predictions per second using TensorFlow

02

Implemented effective optimization techniques for low-latency serving

03

Addressed key challenges in large-scale TensorFlow deployment

Abstract

We present the process of transitioning machine learning models to the TensorFlow framework at a large scale in an online advertising ecosystem. In this talk we address the key challenges we faced and describe how we successfully tackled them; notably, implementing the models in TF and serving them efficiently with low latency using various optimization techniques.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.