Building a Human-Verified Clinical Reasoning Dataset via a Human LLM Hybrid Pipeline for Trustworthy Medical AI

Chao Ding; Mouxiao Bian; Pengcheng Chen; Hongliang Zhang; Tianbin Li; Lihao Liu; Jiayuan Chen; Zhuoran Li; Yabei Zhong; Yongqi Liu; Haiqing Huang; Dongming Shan; Junjun He; Jie Xu

arXiv:2505.06912·cs.CV·May 13, 2025

Building a Human-Verified Clinical Reasoning Dataset via a Human LLM Hybrid Pipeline for Trustworthy Medical AI

Chao Ding, Mouxiao Bian, Pengcheng Chen, Hongliang Zhang, Tianbin Li, Lihao Liu, Jiayuan Chen, Zhuoran Li, Yabei Zhong, Yongqi Liu, Haiqing Huang, Dongming Shan, Junjun He, Jie Xu

PDF

Open Access

TL;DR

This paper introduces a large, expert-validated clinical reasoning dataset for medical AI, created through a hybrid human-LLM pipeline to enhance transparency and trustworthiness in medical question-answering systems.

Contribution

It presents a novel, large-scale, expert-validated dataset with chain-of-thought explanations, curated via a scalable hybrid pipeline for medical AI research.

Findings

01

The dataset contains 31,247 validated medical QA pairs with explanations.

02

Expert review improves the quality and clinical relevance of LLM-generated rationales.

03

The dataset supports development of transparent and trustworthy medical AI models.

Abstract

Despite strong performance in medical question-answering, the clinical adoption of Large Language Models (LLMs) is critically hampered by their opaque 'black-box' reasoning, limiting clinician trust. This challenge is compounded by the predominant reliance of current medical LLMs on corpora from scientific literature or synthetic data, which often lack the granular expert validation and high clinical relevance essential for advancing their specialized medical capabilities. To address these critical gaps, we introduce a highly clinically relevant dataset with 31,247 medical question-answer pairs, each accompanied by expert-validated chain-of-thought (CoT) explanations. This resource, spanning multiple clinical domains, was curated via a scalable human-LLM hybrid pipeline: LLM-generated rationales were iteratively reviewed, scored, and refined by medical experts against a structured…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling