Overview of the ClinIQLink 2025 Shared Task on Medical Question-Answering
Brandon Colelough, Davis Bartels, and Dina Demner-Fushman

TL;DR
This paper overviews ClinIQLink 2025, a shared task evaluating large language models on diverse medical question-answering formats with expert-verified data and automated and human scoring methods.
Contribution
It introduces a comprehensive medical question-answering benchmark with diverse formats and a dual evaluation approach combining automated metrics and expert review.
Findings
Large language models are tested on 4,978 medical QA pairs.
Automated scoring uses exact match and embedding metrics.
Physician panel provides expert validation of top model responses.
Abstract
In this paper, we present an overview of ClinIQLink, a shared task, collocated with the 24th BioNLP workshop at ACL 2025, designed to stress-test large language models (LLMs) on medically-oriented question answering aimed at the level of a General Practitioner. The challenge supplies 4,978 expert-verified, medical source-grounded question-answer pairs that cover seven formats: true/false, multiple choice, unordered list, short answer, short-inverse, multi-hop, and multi-hop-inverse. Participating systems, bundled in Docker or Apptainer images, are executed on the CodaBench platform or the University of Maryland's Zaratan cluster. An automated harness (Task 1) scores closed-ended items by exact match and open-ended items with a three-tier embedding metric. A subsequent physician panel (Task 2) audits the top model responses.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsTopic Modeling · Expert finding and Q&A systems · Artificial Intelligence in Healthcare and Education
