Principled Synthetic Data Enables the First Scaling Laws for LLMs in Recommendation
Benyu Zhang, Qiang Zhang, Jianpeng Cheng, Hong-You Chen, Qifei Wang, Wei Sun, Shen Li, Jia Li, Jiahao Wu, Xiangjun Fan, Hong Yan

TL;DR
This paper introduces a principled synthetic data generation framework that enables the first reliable scaling laws for large language models in recommendation systems, significantly improving their performance and predictability.
Contribution
It presents a novel layered synthetic data framework that produces high-quality, recommendation-specific data, facilitating robust LLM scaling laws and improved downstream performance.
Findings
Models trained on synthetic data outperform real data-trained models in ranking tasks.
First empirical demonstration of power-law scaling for LLMs in recommendation.
Consistent perplexity reduction across multiple synthetic data modalities.
Abstract
Large Language Models (LLMs) represent a promising frontier for recommender systems, yet their development has been impeded by the absence of predictable scaling laws, which are crucial for guiding research and optimizing resource allocation. We hypothesize that this may be attributed to the inherent noise, bias, and incompleteness of raw user interaction data in prior continual pre-training (CPT) efforts. This paper introduces a novel, layered framework for generating high-quality synthetic data that circumvents such issues by creating a curated, pedagogical curriculum for the LLM. We provide powerful, direct evidence for the utility of our curriculum by showing that standard sequential models trained on our principled synthetic data significantly outperform ( on recall@100 for SasRec) models trained on real data in downstream ranking tasks, demonstrating its superiority for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques · Explainable Artificial Intelligence (XAI) · Sentiment Analysis and Opinion Mining
