Tensor Casting: Co-Designing Algorithm-Architecture for Personalized Recommendation Training
Youngeun Kwon, Yunjae Lee, Minsoo Rhu

TL;DR
This paper investigates the training of personalized recommendation models, identifies sparse embedding layer training as a bottleneck, and proposes Tensor Casting, a co-designed accelerator architecture that significantly improves training throughput.
Contribution
It introduces Tensor Casting, a novel algorithm-architecture co-design for tensor gather-scatter, optimizing recommendation training on CPU-GPU systems.
Findings
Tensor Casting achieves up to 21x training throughput improvement.
Workload characterization highlights sparse embedding training as a key bottleneck.
Prototyping demonstrates effectiveness on real CPU-GPU systems.
Abstract
Personalized recommendations are one of the most widely deployed machine learning (ML) workload serviced from cloud datacenters. As such, architectural solutions for high-performance recommendation inference have recently been the target of several prior literatures. Unfortunately, little have been explored and understood regarding the training side of this emerging ML workload. In this paper, we first perform a detailed workload characterization study on training recommendations, root-causing sparse embedding layer training as one of the most significant performance bottlenecks. We then propose our algorithm-architecture co-design called Tensor Casting, which enables the development of a generic accelerator architecture for tensor gather-scatter that encompasses all the key primitives of training embedding layers. When prototyped on a real CPU-GPU system, Tensor Casting provides…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
