Scaling Recommender Transformers to One Billion Parameters
Kirill Khrylchenko, Artem Matveev, Sergei Makeev, Vladimir Baikalov

TL;DR
This paper demonstrates how to train and deploy large-scale transformer recommender models with up to one billion parameters, significantly improving recommendation quality in a real-world music platform.
Contribution
It introduces a scalable training recipe for billion-parameter transformer recommenders and shows effective decomposition of autoregressive learning tasks.
Findings
Achieved successful deployment on a large-scale music platform
Online A/B tests show +2.26% increase in total listening time
User liking likelihood increased by +6.37%
Abstract
While large transformer models have been successfully used in many real-world applications such as natural language processing, computer vision, and speech processing, scaling transformers for recommender systems remains a challenging problem. Recently, Generative Recommenders framework was proposed to scale beyond typical Deep Learning Recommendation Models (DLRMs). Reformulation of recommendation as sequential transduction task led to improvement of scaling properties in terms of compute. Nevertheless, the largest encoder configuration reported by the HSTU authors amounts only to ~176 million parameters, which is considerably smaller than the hundreds of billions or even trillions of parameters common in modern language models. In this work, we present a recipe for training large transformer recommenders with up to a billion parameters. We show that autoregressive learning on user…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Text Analysis Techniques
