Can Large Language Models Self-Correct in Medical Question Answering? An Exploratory Study
Zaifu Zhan, Mengyuan Cui, Rui Zhang

TL;DR
This study explores whether self-reflective prompting improves the accuracy of large language models in medical question answering, finding limited and inconsistent benefits across datasets and models.
Contribution
It provides an empirical analysis of self-reflective prompting in medical QA, revealing its variable effectiveness and highlighting its role as an analytical tool rather than a reliability enhancer.
Findings
Self-reflective prompting yields modest gains on MedQA.
Limited or negative benefits observed on HeadQA and PubMedQA.
Increasing reflection steps does not necessarily improve performance.
Abstract
Large language models (LLMs) have achieved strong performance on medical question answering (medical QA), and chain-of-thought (CoT) prompting has further improved results by eliciting explicit intermediate reasoning; meanwhile, self-reflective (self-corrective) prompting has been widely claimed to enhance model reliability by prompting LLMs to critique and revise their own reasoning, yet its effectiveness in safety-critical medical settings remains unclear. In this work, we conduct an exploratory analysis of self-reflective reasoning for medical multiple-choice question answering: using GPT-4o and GPT-4o-mini, we compare standard CoT prompting with an iterative self-reflection loop and track how predictions evolve across reflection steps on three widely used medical QA benchmarks (MedQA, HeadQA, and PubMedQA). We analyze whether self-reflection leads to error correction, error…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
