Re-Evaluating the Netflix Prize - Human Uncertainty and its Impact on Reliability
Kevin Jasberg, Sergej Sizov

TL;DR
This paper investigates how human rating variability affects the reliability of recommender system evaluations, revealing that many top rankings are statistically uncertain and may be influenced by chance.
Contribution
It introduces a probabilistic approach to rating assessment, accounting for human uncertainty, and re-evaluates the reliability of the Netflix Prize rankings.
Findings
User ratings are inconsistent upon repeated questioning.
Accuracy metrics can be modeled as probability densities.
Top rankings have high probabilities of being due to chance.
Abstract
In this paper, we examine the statistical soundness of comparative assessments within the field of recommender systems in terms of reliability and human uncertainty. From a controlled experiment, we get the insight that users provide different ratings on same items when repeatedly asked. This volatility of user ratings justifies the assumption of using probability densities instead of single rating scores. As a consequence, the well-known accuracy metrics (e.g. MAE, MSE, RMSE) yield a density themselves that emerges from convolution of all rating densities. When two different systems produce different RMSE distributions with significant intersection, then there exists a probability of error for each possible ranking. As an application, we examine possible ranking errors of the Netflix Prize. We are able to show that all top rankings are more or less subject to high probabilities of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsScientific Measurement and Uncertainty Evaluation · Advanced Statistical Process Monitoring · Forecasting Techniques and Applications
