HET: Scaling out Huge Embedding Model Training via Cache-enabled   Distributed Framework

Xupeng Miao; Hailin Zhang; Yining Shi; Xiaonan Nie; Zhi Yang; Yangyu; Tao; Bin Cui

arXiv:2112.07221·cs.LG·December 15, 2021

HET: Scaling out Huge Embedding Model Training via Cache-enabled Distributed Framework

Xupeng Miao, Hailin Zhang, Yining Shi, Xiaonan Nie, Zhi Yang, Yangyu, Tao, Bin Cui

PDF

3 Repos

TL;DR

HET is a scalable distributed framework for training large embedding models that leverages embedding cache and a new consistency model to significantly reduce communication overhead and improve training speed.

Contribution

HET introduces a cache-enabled distributed system with a novel consistency model to enhance scalability of huge embedding model training.

Findings

01

Achieves up to 88% reduction in embedding communication.

02

Realizes up to 20.68x speedup over baselines.

03

Effectively handles skewed embedding popularity distributions.

Abstract

Embedding models have been an effective learning paradigm for high-dimensional data. However, one open issue of embedding models is that their representations (latent factors) often result in large parameter space. We observe that existing distributed training frameworks face a scalability issue of embedding models since updating and retrieving the shared embedding parameters from servers usually dominates the training cycle. In this paper, we propose HET, a new system framework that significantly improves the scalability of huge embedding model training. We embrace skewed popularity distributions of embeddings as a performance opportunity and leverage it to address the communication bottleneck with an embedding cache. To ensure consistency across the caches, we incorporate a new consistency model into HET design, which provides fine-grained consistency guarantees on a per-embedding…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.