On the Reliability of Sampling Strategies in Offline Recommender Evaluation
Bruno L. Pereira, Alan Said, Rodrygo L. T. Santos

TL;DR
This paper examines how different sampling strategies impact the reliability of offline recommender system evaluation, providing insights and guidance to improve evaluation fidelity under exposure biases.
Contribution
It systematically analyzes the effects of logging and sampling choices on offline evaluation reliability using a fully observed dataset as ground truth.
Findings
Sampling strategies vary in their ability to distinguish between models.
Certain sampling methods maintain higher fidelity and robustness.
Guidelines are provided for selecting effective sampling strategies.
Abstract
Offline evaluation plays a central role in benchmarking recommender systems when online testing is impractical or risky. However, it is susceptible to two key sources of bias: exposure bias, where users only interact with items they are shown, and sampling bias, introduced when evaluation is performed on a subset of logged items rather than the full catalog. While prior work has proposed methods to mitigate sampling bias, these are typically assessed on fixed logged datasets rather than for their ability to support reliable model comparisons under varying exposure conditions or relative to true user preferences. In this paper, we investigate how different combinations of logging and sampling choices affect the reliability of offline evaluation. Using a fully observed dataset as ground truth, we systematically simulate diverse exposure biases and assess the reliability of common sampling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
