TL;DR
This paper introduces TripJudge, a new relevance judgement test collection for TripClick health retrieval, addressing biases and coverage issues in previous click-based datasets, and demonstrating its impact on system evaluation.
Contribution
The paper presents TripJudge, a novel, human-annotated relevance test collection for TripClick, improving reliability and coverage over previous click-based datasets.
Findings
TripJudge improves relevance assessment quality.
Evaluation results differ significantly between click-based and judgement-based methods.
TripJudge enhances the reliability of health retrieval system evaluation.
Abstract
Robust test collections are crucial for Information Retrieval research. Recently there is a growing interest in evaluating retrieval systems for domain-specific retrieval tasks, however these tasks often lack a reliable test collection with human-annotated relevance assessments following the Cranfield paradigm. In the medical domain, the TripClick collection was recently proposed, which contains click log data from the Trip search engine and includes two click-based test sets. However the clicks are biased to the retrieval model used, which remains unknown, and a previous study shows that the test sets have a low judgement coverage for the Top-10 results of lexical and neural retrieval models. In this paper we present the novel, relevance judgement test collection TripJudge for TripClick health retrieval. We collect relevance judgements in an annotation campaign and ensure the quality…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsTest
