TL;DR
This paper investigates how different dataset sampling strategies affect recommendation algorithm performance and introduces methods to select sampling schemes that preserve model effectiveness, enabling more efficient data usage.
Contribution
It characterizes the impact of sampling on recommendation performance, proposes SVP-CF for better sampling, and develops Data-Genie to recommend optimal sampling schemes for datasets.
Findings
Data sampling significantly influences algorithm performance.
Data-Genie can discard up to 5x more data without performance loss.
SVP-CF effectively preserves model rankings after sampling.
Abstract
We study the practical consequences of dataset sampling strategies on the ranking performance of recommendation algorithms. Recommender systems are generally trained and evaluated on samples of larger datasets. Samples are often taken in a naive or ad-hoc fashion: e.g. by sampling a dataset randomly or by selecting users or items with many interactions. As we demonstrate, commonly-used data sampling schemes can have significant consequences on algorithm performance. Following this observation, this paper makes three main contributions: (1) characterizing the effect of sampling on algorithm performance, in terms of algorithm and dataset characteristics (e.g. sparsity characteristics, sequential dynamics, etc.); (2) designing SVP-CF, which is a data-specific sampling strategy, that aims to preserve the relative performance of models after sampling, and is especially suited to long-tailed…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
