Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering

Rezarta Islamaj; Joey Chan; Robert Leaman; Jongmyung Jung; Hyeongsoon Hwang; Quoc-An Nguyen; Hoang-Quynh Le; Harikrishnan Gurushankar Saisudha; Ganesh Chandrasekar; Rustam R. Taktashov; Nadezhda Yu. Bizyukova; Sofia I. R. Concei\c{c}\~ao; Paulo R. C. Lopes; Reem Abdel Salam; Mary Adewunmi; Zhiyong Lu

arXiv:2605.12313·cs.CL·May 13, 2026

Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering

Rezarta Islamaj, Joey Chan, Robert Leaman, Jongmyung Jung, Hyeongsoon Hwang, Quoc-An Nguyen, Hoang-Quynh Le, Harikrishnan Gurushankar Saisudha, Ganesh Chandrasekar, Rustam R. Taktashov, Nadezhda Yu. Bizyukova, Sofia I. R. Concei\c{c}\~ao, Paulo R. C. Lopes, Reem Abdel Salam

PDF

TL;DR

The MedHopQA shared task benchmarks multi-hop biomedical question answering using a novel dataset, highlighting the importance of retrieval strategies and concept-level evaluation for improving system performance.

Contribution

Introduced a new dataset and evaluation framework for multi-hop biomedical QA, demonstrating the effectiveness of retrieval-augmented methods and concept-level scoring.

Findings

01

Top system achieved 89.30% F1 score on MedCPT metric.

02

Retrieval-augmented generation was crucial for high performance.

03

Concept-level evaluation improved answer assessment.

Abstract

Multi-hop question answering (QA) remains a significant challenge in the biomedical domain, requiring systems to integrate information across multiple sources to answer complex questions. To address this problem, the BioCreative IX MedHopQA shared task was designed to benchmark in multi-hop reasoning for large language models (LLMs). We developed a novel dataset of 1,000 challenging QA pairs spanning diseases, genes, and chemicals, with particular emphasis on rare diseases. Each question was constructed to require two-hop reasoning through the integration of information from two distinct Wikipedia pages. The challenge attracted 48 submissions from 13 teams. Systems were evaluated using both surface string comparison and conceptual accuracy (MedCPT score). The results showed a substantial performance gap between baseline LLMs and enhanced systems. The top-ranked submission achieved an…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.