Communication-Efficient TeraByte-Scale Model Training Framework for   Online Advertising

Weijie Zhao; Xuewu Jiao; Mingqing Hu; Xiaoyun Li; Xiangyu Zhang; Ping; Li

arXiv:2201.05500·cs.IR·January 17, 2022

Communication-Efficient TeraByte-Scale Model Training Framework for Online Advertising

Weijie Zhao, Xuewu Jiao, Mingqing Hu, Xiaoyun Li, Xiangyu Zhang, Ping, Li

PDF

Open Access

TL;DR

This paper presents a hardware-aware, communication-efficient training framework for large-scale CTR models, significantly reducing training time without sacrificing accuracy by optimizing data communication and introducing a novel $k$-step Adam optimizer.

Contribution

It introduces a hardware-aware training workflow and a $k$-step model merging algorithm for Adam, addressing communication bottlenecks in large-scale CTR model training.

Findings

01

Reduces training time significantly on real-world data

02

Maintains model accuracy despite communication optimizations

03

First application of $k$-step Adam in industrial CTR training

Abstract

Click-Through Rate (CTR) prediction is a crucial component in the online advertising industry. In order to produce a personalized CTR prediction, an industry-level CTR prediction model commonly takes a high-dimensional (e.g., 100 or 1000 billions of features) sparse vector (that is encoded from query keywords, user portraits, etc.) as input. As a result, the model requires Terabyte scale parameters to embed the high-dimensional input. Hierarchical distributed GPU parameter server has been proposed to enable GPU with limited memory to train the massive network by leveraging CPU main memory and SSDs as secondary storage. We identify two major challenges in the existing GPU training framework for massive-scale ad models and propose a collection of optimizations to tackle these challenges: (a) the GPU, CPU, SSD rapidly communicate with each other during the training. The connections between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage and Video Quality Assessment · Recommender Systems and Techniques · Stochastic Gradient Optimization Techniques

MethodsNon Maximum Suppression · Convolution · 1x1 Convolution · SSD · Adam