Scaling TensorFlow to 300 million predictions per second
Jan Hartman, Davorin Kopi\v{c}

TL;DR
This paper discusses scaling TensorFlow to handle 300 million predictions per second in an online advertising setting, focusing on challenges and optimization techniques for low-latency model serving.
Contribution
It details the process of transitioning large-scale machine learning models to TensorFlow and optimizing their deployment for high-throughput, low-latency predictions.
Findings
Achieved 300 million predictions per second using TensorFlow
Implemented effective optimization techniques for low-latency serving
Addressed key challenges in large-scale TensorFlow deployment
Abstract
We present the process of transitioning machine learning models to the TensorFlow framework at a large scale in an online advertising ecosystem. In this talk we address the key challenges we faced and describe how we successfully tackled them; notably, implementing the models in TF and serving them efficiently with low latency using various optimization techniques.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
