TL;DR
This paper evaluates the reliability of joint fairness and relevance measures in recommender systems, revealing their weak correlations, insensitivity to rank changes, and limited expressiveness, thus urging cautious use.
Contribution
It provides the first empirical analysis of joint fairness-relevance measures across multiple datasets and recommenders, highlighting their limitations and offering guidelines for proper usage.
Findings
Most measures correlate weakly and sometimes contradict each other.
They are less sensitive to rank position changes than traditional measures.
They tend to compress scores at the low end, limiting expressiveness.
Abstract
Relevance and fairness are two major objectives of recommender systems (RSs). Recent work proposes measures of RS fairness that are either independent from relevance (fairness-only) or conditioned on relevance (joint measures). While fairness-only measures have been studied extensively, we look into whether joint measures can be trusted. We collect all joint evaluation measures of RS relevance and fairness, and ask: How much do they agree with each other? To what extent do they agree with relevance/fairness measures? How sensitive are they to changes in rank position, or to increasingly fair and relevant recommendations? We empirically study for the first time the behaviour of these measures across 4 real-world datasets and 4 recommenders. We find that most of these measures: i) correlate weakly with one another and even contradict each other at times; ii) are less sensitive to rank…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsNeighborhood Contrastive Learning
