On Sampling Collaborative Filtering Datasets

Noveen Sachdeva; Carole-Jean Wu; Julian McAuley

arXiv:2201.04768·cs.LG·January 14, 2022

On Sampling Collaborative Filtering Datasets

Noveen Sachdeva, Carole-Jean Wu, Julian McAuley

PDF

1 Repo

TL;DR

This paper investigates how different dataset sampling strategies affect recommendation algorithm performance and introduces methods to select sampling schemes that preserve model effectiveness, enabling more efficient data usage.

Contribution

It characterizes the impact of sampling on recommendation performance, proposes SVP-CF for better sampling, and develops Data-Genie to recommend optimal sampling schemes for datasets.

Findings

01

Data sampling significantly influences algorithm performance.

02

Data-Genie can discard up to 5x more data without performance loss.

03

SVP-CF effectively preserves model rankings after sampling.

Abstract

We study the practical consequences of dataset sampling strategies on the ranking performance of recommendation algorithms. Recommender systems are generally trained and evaluated on samples of larger datasets. Samples are often taken in a naive or ad-hoc fashion: e.g. by sampling a dataset randomly or by selecting users or items with many interactions. As we demonstrate, commonly-used data sampling schemes can have significant consequences on algorithm performance. Following this observation, this paper makes three main contributions: (1) characterizing the effect of sampling on algorithm performance, in terms of algorithm and dataset characteristics (e.g. sparsity characteristics, sequential dynamics, etc.); (2) designing SVP-CF, which is a data-specific sampling strategy, that aims to preserve the relative performance of models after sampling, and is especially suited to long-tailed…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

noveens/sampling_cf
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.