TL;DR
This paper reveals that offline evaluation of recommendation systems is affected by Simpson's paradox due to confounding factors from deployed systems, and proposes a new evaluation method that improves correlation with true rankings.
Contribution
The paper identifies Simpson's paradox in offline recommendation evaluation and introduces a novel methodology that accounts for confounders, enhancing evaluation accuracy.
Findings
Stratified sampling exposes confounding effects of frequently exposed items.
Proposed evaluation method improves correlation with true rankings by 14-40%.
Method shows statistically significant better performance on open loop datasets.
Abstract
Recommendation systems are often evaluated based on user's interactions that were collected from an existing, already deployed recommendation system. In this situation, users only provide feedback on the exposed items and they may not leave feedback on other items since they have not been exposed to them by the deployed system. As a result, the collected feedback dataset that is used to evaluate a new model is influenced by the deployed system, as a form of closed loop feedback. In this paper, we show that the typical offline evaluation of recommender systems suffers from the so-called Simpson's paradox. Simpson's paradox is the name given to a phenomenon observed when a significant trend appears in several different sub-populations of observational data but disappears or is even reversed when these sub-populations are combined together. Our in-depth experiments based on stratified…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
