Quality Metrics in Recommender Systems: Do We Calculate Metrics Consistently?
Yan-Martin Tamm, Rinchin Damdinov, Alexey Vasilev

TL;DR
This paper investigates the consistency of quality metrics in recommender system evaluations, revealing widespread ambiguities and inconsistencies in their definitions and implementations across literature and tools.
Contribution
It provides a comprehensive analysis of how metrics are defined and used, highlighting the need for clearer, standardized evaluation protocols in recommender systems research.
Findings
Precision is the only universally understood metric.
Different libraries implement the same metric name differently.
Nearly half of academic papers lack clear metric definitions.
Abstract
Offline evaluation is a popular approach to determine the best algorithm in terms of the chosen quality metric. However, if the chosen metric calculates something unexpected, this miscommunication can lead to poor decisions and wrong conclusions. In this paper, we thoroughly investigate quality metrics used for recommender systems evaluation. We look at the practical aspect of implementations found in modern RecSys libraries and at the theoretical aspect of definitions in academic papers. We find that Precision is the only metric universally understood among papers and libraries, while other metrics may have different interpretations. Metrics implemented in different libraries sometimes have the same name but measure different things, which leads to different results given the same input. When defining metrics in an academic paper, authors sometimes omit explicit formulations or give…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
