MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

Rezarta Islamaj; Robert Leaman; Joey Chan; Nicholas Wan; Qiao Jin; Natalie Xie; John Wilbur; Shubo Tian; Lana Yeganova; Po-Ting Lai; Chih-Hsuan Wei; Yifan Yang; Yao Ge; Qingqing Zhu; Zhizheng Wang; and Zhiyong Lu

arXiv:2605.12361·cs.CL·May 13, 2026

MedHopQA: A Disease-Centered Multi-Hop Reasoning Benchmark and Evaluation Framework for LLM-Based Biomedical Question Answering

Rezarta Islamaj, Robert Leaman, Joey Chan, Nicholas Wan, Qiao Jin, Natalie Xie, John Wilbur, Shubo Tian, Lana Yeganova, Po-Ting Lai, Chih-Hsuan Wei, Yifan Yang, Yao Ge, Qingqing Zhu, Zhizheng Wang, and Zhiyong Lu

PDF

TL;DR

MedHopQA is a new biomedical multi-hop reasoning benchmark with 1,000 expert-curated questions requiring synthesis across Wikipedia articles, designed to evaluate LLM reasoning capabilities while minimizing gaming and contamination.

Contribution

It introduces a disease-centered multi-hop reasoning benchmark with a structured construction process and a framework for future biomedical QA dataset development.

Findings

01

Benchmark emphasizes reasoning over pattern matching.

02

Questions require synthesis from two Wikipedia articles.

03

Framework supports contamination-resistant dataset creation.

Abstract

Evaluating large language models (LLMs) in the biomedical domain requires benchmarks that can distinguish reasoning from pattern matching and remain discriminative as model capabilities improve. Existing biomedical question answering (QA) benchmarks are limited in this respect. Multiple-choice formats can allow models to succeed through answer elimination rather than inference, while widely circulated exam-style datasets are increasingly vulnerable to performance saturation and training data contamination. Multi-hop reasoning, defined as the ability to integrate information across multiple sources to derive an answer, is central to clinically meaningful tasks such as diagnostic support, literature-based discovery, and hypothesis generation, yet remains underrepresented in current biomedical QA benchmarks. MedHopQA is a disease-centered multi-hop reasoning benchmark consisting of 1,000…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.