Distributed Hierarchical GPU Parameter Server for Massive Scale Deep Learning Ads Systems
Weijie Zhao, Deping Xie, Ronglai Jia, Yulei Qian, Ruiquan Ding,, Mingming Sun, Ping Li

TL;DR
This paper presents a hierarchical GPU-based parameter server system designed for massive-scale deep learning in online advertising, enabling faster training of billion-parameter models with improved cost efficiency.
Contribution
It introduces a novel three-layer hierarchical storage architecture utilizing GPU memory, CPU memory, and SSD for scalable deep learning training of extremely large models.
Findings
4-node hierarchical GPU server trains 2x faster than 150-node in-memory server
System achieves 4-9x better price-performance ratio than MPI clusters
Effective handling of models with over 10^11 sparse features
Abstract
Neural networks of ads systems usually take input from multiple resources, e.g., query-ad relevance, ad features and user portraits. These inputs are encoded into one-hot or multi-hot binary features, with typically only a tiny fraction of nonzero feature values per example. Deep learning models in online advertising industries can have terabyte-scale parameters that do not fit in the GPU memory nor the CPU main memory on a computing node. For example, a sponsored online advertising system can contain more than sparse features, making the neural network a massive model with around 10 TB parameters. In this paper, we introduce a distributed GPU hierarchical parameter server for massive scale deep learning ads systems. We propose a hierarchical workflow that utilizes GPU High-Bandwidth Memory, CPU main memory and SSD as 3-layer hierarchical storage. All the neural network…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image and Video Retrieval Techniques · Recommender Systems and Techniques · Stochastic Gradient Optimization Techniques
MethodsConvolution · Non Maximum Suppression · 1x1 Convolution · SSD
