Offline Retrieval Evaluation Without Evaluation Metrics
Fernando Diaz, Andres Ferraro

TL;DR
This paper introduces recall-paired preference (RPP), a metric-free evaluation method for offline retrieval that directly compares ranked lists, improving discrimination and robustness over traditional scalar metrics.
Contribution
The paper proposes RPP, a new evaluation approach that avoids scalar metrics, directly compares ranked lists, and better captures differences across user subpopulations.
Findings
RPP correlates well with existing metrics.
RPP improves discriminative power in evaluations.
RPP is robust to incomplete data.
Abstract
Offline evaluation of information retrieval and recommendation has traditionally focused on distilling the quality of a ranking into a scalar metric such as average precision or normalized discounted cumulative gain. We can use this metric to compare the performance of multiple systems for the same request. Although evaluation metrics provide a convenient summary of system performance, they also collapse subtle differences across users into a single number and can carry assumptions about user behavior and utility not supported across retrieval scenarios. We propose recall-paired preference (RPP), a metric-free evaluation method based on directly computing a preference between ranked lists. RPP simulates multiple user subpopulations per query and compares systems across these pseudo-populations. Our results across multiple search and recommendation tasks demonstrate that RPP…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsInformation Retrieval and Search Behavior · Recommender Systems and Techniques · Multi-Criteria Decision Making
