Understanding Data Storage and Ingestion for Large-Scale Deep Recommendation Model Training
Mark Zhao, Niket Agarwal, Aarti Basant, Bugra Gedik, Satadru Pan,, Mustafa Ozdal, Rakesh Komuravelli, Jerry Pan, Tianshu Bao, Haowei Lu,, Sundaram Narayanan, Jack Langman, Kevin Wilfong, Harsha Rastogi, Carole-Jean, Wu, Christos Kozyrakis, Parik Pol

TL;DR
This paper analyzes Meta's large-scale data storage and ingestion pipeline for deep learning training, highlighting bottlenecks, resource usage, and opportunities for hardware and infrastructure improvements.
Contribution
It provides a detailed characterization of Meta's DSI pipeline at scale, including infrastructure design, resource demands, and bottleneck analysis, with insights for optimization.
Findings
Identified hardware bottlenecks in DSI systems
Demonstrated the resource intensity of data preprocessing during training
Highlighted the need for heterogeneous DSI hardware and improved scheduling
Abstract
Datacenter-scale AI training clusters consisting of thousands of domain-specific accelerators (DSA) are used to train increasingly-complex deep learning models. These clusters rely on a data storage and ingestion (DSI) pipeline, responsible for storing exabytes of training data and serving it at tens of terabytes per second. As DSAs continue to push training efficiency and throughput, the DSI pipeline is becoming the dominating factor that constrains the overall training performance and capacity. Innovations that improve the efficiency and performance of DSI systems and hardware are urgent, demanding a deep understanding of DSI characteristics and infrastructure at scale. This paper presents Meta's end-to-end DSI pipeline, composed of a central data warehouse built on distributed storage and a Data PreProcessing Service that scales to eliminate data stalls. We characterize how…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methodstravel james
