Communication-Efficient TeraByte-Scale Model Training Framework for Online Advertising
Weijie Zhao, Xuewu Jiao, Mingqing Hu, Xiaoyun Li, Xiangyu Zhang, Ping, Li

TL;DR
This paper presents a hardware-aware, communication-efficient training framework for large-scale CTR models, significantly reducing training time without sacrificing accuracy by optimizing data communication and introducing a novel $k$-step Adam optimizer.
Contribution
It introduces a hardware-aware training workflow and a $k$-step model merging algorithm for Adam, addressing communication bottlenecks in large-scale CTR model training.
Findings
Reduces training time significantly on real-world data
Maintains model accuracy despite communication optimizations
First application of $k$-step Adam in industrial CTR training
Abstract
Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry. In order to produce a personalized CTR prediction, an industry-level CTR prediction model commonly takes a high-dimensional (e.g., 100 or 1000 billions of features) sparse vector (that is encoded from query keywords, user portraits, etc.) as input. As a result, the model requires Terabyte scale parameters to embed the high-dimensional input. Hierarchical distributed GPU parameter server has been proposed to enable GPU with limited memory to train the massive network by leveraging CPU main memory and SSDs as secondary storage. We identify two major challenges in the existing GPU training framework for massive-scale ad models and propose a collection of optimizations to tackle these challenges: (a) the GPU, CPU, SSD rapidly communicate with each other during the training. The connections between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage and Video Quality Assessment · Recommender Systems and Techniques · Stochastic Gradient Optimization Techniques
MethodsNon Maximum Suppression · Convolution · 1x1 Convolution · SSD · Adam
