Faster Distributed Inference-Only Recommender Systems via Bounded Lag Synchronous Collectives
Kiril Dichev, Filip Pawlowski, Albert-Jan Yzelman

TL;DR
This paper introduces a bounded lag synchronous alltoallv communication method for distributed recommender systems, improving inference latency and throughput in unbalanced or irregular access scenarios by allowing controlled process lagging.
Contribution
It proposes a novel BLS alltoallv operation that reduces synchronization overhead in distributed DLRMs, especially effective in unbalanced or irregular access conditions.
Findings
Improves latency and throughput in unbalanced DLRM runs
Masks process delays in inference-only scenarios
No notable advantage in well-balanced runs
Abstract
Recommender systems are enablers of personalized content delivery, and therefore revenue, for many large companies. In the last decade, deep learning recommender models (DLRMs) are the de-facto standard in this field. The main bottleneck in DLRM inference is the lookup of sparse features across huge embedding tables, which are usually partitioned across the aggregate RAM of many nodes. In state-of-the-art recommender systems, the distributed lookup is implemented via irregular all-to-all (alltoallv) communication, and often presents the main bottleneck. Today, most related work sees this operation as a given; in addition, every collective is synchronous in nature. In this work, we propose a novel bounded lag synchronous (BLS) version of the alltoallv operation. The bound can be a parameter allowing slower processes to lag behind entire iterations before the fastest processes block. In…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Caching and Content Delivery · Machine Learning in Healthcare
