EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

Mingyang Wei; Dehai Min; Zewen Liu; Yuzhang Xie; Guanchen Wu; Ziyang Zhang; Carl Yang; Max S. Y. Lau; Qi He; Lu Cheng; Wei Jin

arXiv:2601.03471·cs.CL·March 19, 2026

EpiQAL: Benchmarking Large Language Models in Epidemiological Question Answering for Enhanced Alignment and Reasoning

Mingyang Wei, Dehai Min, Zewen Liu, Yuzhang Xie, Guanchen Wu, Ziyang Zhang, Carl Yang, Max S. Y. Lau, Qi He, Lu Cheng, Wei Jin

PDF

Open Access

TL;DR

EpiQAL is a new benchmark designed to evaluate large language models' ability to perform epidemiological reasoning, highlighting current limitations and the need for improved inference capabilities in health-related AI systems.

Contribution

This paper introduces EpiQAL, the first comprehensive benchmark for epidemiological question answering, with diverse subsets testing factual recall, inference, and conclusion reconstruction.

Findings

01

Current LLMs perform poorly on epidemiological reasoning tasks.

02

Multi-step inference is the most challenging aspect for models.

03

Scaling models alone does not improve epidemiological reasoning performance.

Abstract

Reliable epidemiological reasoning requires synthesizing study evidence to infer disease burden, transmission dynamics, and intervention effects at the population level. Existing medical question answering benchmarks primarily emphasize clinical knowledge or patient-level reasoning, yet few systematically evaluate evidence-grounded epidemiological inference. We present EpiQAL, the first diagnostic benchmark for epidemiological question answering across diverse diseases, comprising three subsets built from open-access literature. The three subsets progressively test factual recall, multi-step inference, and conclusion reconstruction under incomplete information, and are constructed through a quality-controlled pipeline combining taxonomy guidance, multi-model verification, and difficulty screening. Experiments on fourteen models spanning open-source and proprietary systems reveal that…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsTopic Modeling · Machine Learning in Healthcare · Multimodal Machine Learning Applications