Expanding Relevance Judgments for Medical Case-based Retrieval Task with Multimodal LLMs
Catarina Pires, S\'ergio Nunes, Lu\'is Filipe Teixeira

TL;DR
This paper demonstrates that Multimodal Large Language Models can significantly expand relevance judgments in medical retrieval tasks, reducing manual effort while maintaining substantial agreement with human assessments.
Contribution
The study introduces a novel method using MLLMs to automatically generate large-scale relevance judgments, greatly increasing dataset size with high agreement to human labels.
Findings
MLLM-based judgments achieved Cohen's Kappa of 0.6 with human judgments.
Expanded dataset by over 37 times, from 15,028 to 558,653 judgments.
Approximately 99% of new annotations were non-relevant, reflecting domain sparsity.
Abstract
Evaluating Information Retrieval (IR) systems relies on high-quality manual relevance judgments (qrels), which are costly and time-consuming to obtain. While pooling reduces the annotation effort, it results in only partially labeled datasets. Large Language Models (LLMs) offer a promising alternative to reducing reliance on manual judgments, particularly in complex domains like medical case-based retrieval, where relevance assessment requires analyzing both textual and visual information. In this work, we explore using a Multimodal Large Language Model (MLLM) to expand relevance judgments, creating a new dataset of automated judgments. Specifically, we employ Gemini 1.5 Pro on the ImageCLEFmed 2013 case-based retrieval task, simulating human assessment through an iteratively refined, structured prompting strategy that integrates binary scoring, instruction-based evaluation, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
