Can large language models reason about medical questions?
Valentin Li\'evin, Christoffer Egeberg Hother, Andreas Geert, Motzfeldt, Ole Winther

TL;DR
This study evaluates large language models' ability to reason about complex medical questions, demonstrating that with advanced prompting techniques, models like GPT-3.5 can achieve passing scores on medical benchmarks, with open-source models closing the gap.
Contribution
It provides a comprehensive assessment of LLMs' reasoning in medical domains using various prompting strategies and shows open-source models' potential to match proprietary models.
Findings
GPT-3.5 achieves passing scores on three medical benchmarks.
Prompt engineering significantly improves model performance.
Open-source Llama-2 70B passes MedQA-USMLE with 62.5% accuracy.
Abstract
Although large language models (LLMs) often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether close- and open-source models (GPT-3.5, LLama-2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-USMLE, MedMCQA, and PubMedQA) and multiple prompting scenarios: Chain-of-Thought (CoT, think step-by-step), few-shot and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions, but also reaches the passing score on…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Artificial Intelligence in Healthcare and Education · Machine Learning in Healthcare
Methods15 Ways to Contact How can i speak to someone at Delta Airlines · Multi-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Byte Pair Encoding · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam · Residual Connection · Attention Dropout
