BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models

Kriti Bhattarai; Vipina K. Keloth; Donald Wright; Andrew Loza; Yang Ren; Hua Xu

arXiv:2601.12632·cs.CL·January 21, 2026

BioPulse-QA: A Dynamic Biomedical Question-Answering Benchmark for Evaluating Factuality, Robustness, and Bias in Large Language Models

Kriti Bhattarai, Vipina K. Keloth, Donald Wright, Andrew Loza, Yang Ren, Hua Xu

PDF

Open Access

TL;DR

BioPulse-QA is a new dynamic benchmark for evaluating biomedical large language models on recent, real-world documents, emphasizing factuality, robustness, and bias, to better reflect clinical application challenges.

Contribution

It introduces BioPulse-QA, a comprehensive, up-to-date biomedical QA benchmark with expert-verified questions, covering extractive and abstractive formats, to evaluate LLMs' performance on recent biomedical texts.

Findings

01

GPT-o1 achieves highest relaxed F1 score of 0.92 on drug labels.

02

Clinical trial questions are the most challenging, with F1 scores as low as 0.36.

03

Bias testing shows negligible differences across demographic groups.

Abstract

Objective: Large language models (LLMs) are increasingly applied in biomedical settings, and existing benchmark datasets have played an important role in supporting model development and evaluation. However, these benchmarks often have limitations. Many rely on static or outdated datasets that fail to capture the dynamic, context-rich, and high-stakes nature of biomedical knowledge. They also carry increasing risk of data leakage due to overlap with model pretraining corpora and often overlook critical dimensions such as robustness to linguistic variation and potential demographic biases. Materials and Methods: To address these gaps, we introduce BioPulse-QA, a benchmark that evaluates LLMs on answering questions from newly published biomedical documents including drug labels, trial protocols, and clinical guidelines. BioPulse-QA includes 2,280 expert-verified question answering (QA)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Healthcare