Domain Fine-Tuning vs. Retrieval-Augmented Generation for Medical Multiple-Choice Question Answering: A Controlled Comparison at the 4B-Parameter Scale
Avi-ad Avraam Buskila

TL;DR
This study compares domain fine-tuning and retrieval-augmented generation for medical question answering, finding that domain fine-tuning significantly improves accuracy over RAG at a 4B-parameter scale.
Contribution
It provides a controlled comparison showing domain fine-tuning outperforms retrieval-augmented methods in medical QA at this scale.
Findings
Domain fine-tuning improves accuracy by 6.8 percentage points.
Retrieval-augmented generation does not significantly improve performance.
Domain knowledge in model weights is more effective than in-context retrieval.
Abstract
Practitioners deploying small open-weight large language models (LLMs) for medical question answering face a recurring design choice: invest in a domain-fine-tuned model, or keep a general-purpose model and inject domain knowledge at inference time via retrieval-augmented generation (RAG). We isolate this trade-off by holding model size, prompt template, decoding temperature, retrieval pipeline, and evaluation protocol fixed, and varying only (i) whether the model has been domain-adapted (Gemma 3 4B vs. MedGemma 4B, both 4-bit quantized and served via Ollama) and (ii) whether retrieved passages from a medical knowledge corpus are inserted into the prompt. We evaluate all four cells of this 2x2 design on the full MedQA-USMLE 4-option test split (1,273 questions) with three repetitions per question (15,276 LLM calls). Domain fine-tuning yields a +6.8 percentage-point gain in majority-vote…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
