Large-Scale Training System for 100-Million Classification at Alibaba
Liuyihan Song, Pan Pan, Kang Zhao, Hao Yang, Yiming Chen, and Yingya Zhang, Yinghui Xu, Rong Jin

TL;DR
This paper introduces a large-scale training system for 100-million class classification, combining a hybrid parallel framework, a novel KNN softmax, and optimization strategies to significantly improve training efficiency and reduce iterations.
Contribution
The paper presents a novel large-scale training system with a new softmax variation and optimization techniques, enabling efficient training of extremely large classifiers.
Findings
3.9× training throughput increase
60% reduction in training iterations
Successful training of 100 million classes in five days
Abstract
In the last decades, extreme classification has become an essential topic for deep learning. It has achieved great success in many areas, especially in computer vision and natural language processing (NLP). However, it is very challenging to train a deep model with millions of classes due to the memory and computation explosion in the last output layer. In this paper, we propose a large-scale training system to address these challenges. First, we build a hybrid parallel training framework to make the training process feasible. Second, we propose a novel softmax variation named KNN softmax, which reduces both the GPU memory consumption and computation costs and improves the throughput of training. Then, to eliminate the communication overhead, we propose a new overlapping pipeline and a gradient sparsification method. Furthermore, we design a fast continuous convergence strategy to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsGradient Sparsification · Softmax
