TL;DR
This paper presents a method to validate synthetic user interaction data generated by click models in data-sparse living lab environments, enabling reliable evaluation of retrieval systems with limited user data.
Contribution
It introduces an evaluation approach for validating click model-generated data against known system rankings in human-in-the-loop settings with sparse data.
Findings
Simple click models can reliably evaluate system performance with 20 sessions.
Complex click models need more data but perform better in simulated experiments.
Distinguishing between diverse systems is easier than reproducing identical rankings.
Abstract
Evaluating retrieval performance without editorial relevance judgments is challenging, but instead, user interactions can be used as relevance signals. Living labs offer a way for small-scale platforms to validate information retrieval systems with real users. If enough user interaction data are available, click models can be parameterized from historical sessions to evaluate systems before exposing users to experimental rankings. However, interaction data are sparse in living labs, and little is studied about how click models can be validated for reliable user simulations when click data are available in moderate amounts. This work introduces an evaluation approach for validating synthetic usage data generated by click models in data-sparse human-in-the-loop environments like living labs. We ground our methodology on the click model's estimates about a system ranking compared to a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
