Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation

Shefayat E Shams Adib; Ahmed Alfey Sani; Ekramul Alam Esham; Ajwad Abrar; Tareque Mohmud Chowdhury

arXiv:2602.14564·cs.CL·February 17, 2026

Assessing Large Language Models for Medical QA: Zero-Shot and LLM-as-a-Judge Evaluation

Shefayat E Shams Adib, Ahmed Alfey Sani, Ekramul Alam Esham, Ajwad Abrar, Tareque Mohmud Chowdhury

PDF

Open Access

TL;DR

This study evaluates five large language models on medical question-answering tasks using zero-shot methods, demonstrating that larger models generally perform better and highlighting the potential for practical deployment in healthcare settings.

Contribution

It introduces a standardized benchmark for evaluating LLMs in medical QA without fine-tuning, comparing multiple models and analyzing their performance in clinical contexts.

Findings

01

Larger models outperform smaller ones in medical QA tasks.

02

Llama-4-Maverick-17B shows competitive results with efficiency benefits.

03

Model performance improves with increased size, supporting scaling benefits in clinical NLP.

Abstract

Recently, Large Language Models (LLMs) have gained significant traction in medical domain, especially in developing a QA systems to Medical QA systems for enhancing access to healthcare in low-resourced settings. This paper compares five LLMs deployed between April 2024 and August 2025 for medical QA, using the iCliniq dataset, containing 38,000 medical questions and answers of diverse specialties. Our models include Llama-3-8B-Instruct, Llama 3.2 3B, Llama 3.3 70B Instruct, Llama-4-Maverick-17B-128E-Instruct, and GPT-5-mini. We are using a zero-shot evaluation methodology and using BLEU and ROUGE metrics to evaluate performance without specialized fine-tuning. Our results show that larger models like Llama 3.3 70B Instruct outperform smaller models, consistent with observed scaling benefits in clinical tasks. It is notable that, Llama-4-Maverick-17B exhibited more competitive results,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Topic Modeling · Machine Learning in Healthcare