Generating Diverse Synthetic Datasets for Evaluation of Real-life Recommender Systems
Miha Malen\v{s}ek, Bla\v{z} \v{S}krlj, Bla\v{z} Mramor, Jure, Dem\v{s}ar

TL;DR
This paper introduces a flexible, modular framework for generating diverse, high-quality synthetic datasets tailored for evaluating real-life recommender systems, addressing limitations of existing methods.
Contribution
The authors present a novel, open-source Python framework that enables controlled, customizable synthetic dataset generation for recommender system research.
Findings
Framework effectively isolates model behavior in diverse scenarios
Enables benchmarking and bias detection in recommender systems
Supports iterative modifications for specific experimental needs
Abstract
Synthetic datasets are important for evaluating and testing machine learning models. When evaluating real-life recommender systems, high-dimensional categorical (and sparse) datasets are often considered. Unfortunately, there are not many solutions that would allow generation of artificial datasets with such characteristics. For that purpose, we developed a novel framework for generating synthetic datasets that are diverse and statistically coherent. Our framework allows for creation of datasets with controlled attributes, enabling iterative modifications to fit specific experimental needs, such as introducing complex feature interactions, feature cardinality, or specific distributions. We demonstrate the framework's utility through use cases such as benchmarking probabilistic counting algorithms, detecting algorithmic bias, and simulating AutoML searches. Unlike existing methods that…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRecommender Systems and Techniques
MethodsFocus
