Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study

Zaifu Zhan; Mengyuan Cui; Rui Zhang

arXiv:2604.00261·cs.CL·April 3, 2026

Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study

Zaifu Zhan, Mengyuan Cui, Rui Zhang

PDF

TL;DR

This study explores whether self-reflective prompting improves the accuracy of large language models in medical question answering, finding limited and inconsistent benefits across datasets and models.

Contribution

It provides an empirical analysis of self-reflective prompting in medical QA, revealing its variable effectiveness and highlighting its role as an analytical tool rather than a reliability enhancer.

Findings

01

Self-reflective prompting yields modest gains on MedQA.

02

Limited or negative benefits observed on HeadQA and PubMedQA.

03

Increasing reflection steps does not necessarily improve performance.

Abstract

Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.