A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing

Naeimeh Nourmohammadi; Md Meem Hossain; The Anh Han; Safina Showkat Ara; and Zia Ush Shamszaman

arXiv:2602.14158·cs.CL·February 17, 2026

A Multi-Agent Framework for Medical AI: Leveraging Fine-Tuned GPT, LLaMA, and DeepSeek R1 for Evidence-Based and Bias-Aware Clinical Query Processing

Naeimeh Nourmohammadi, Md Meem Hossain, The Anh Han, Safina Showkat Ara, and Zia Ush Shamszaman

PDF

Open Access

TL;DR

This paper introduces a multi-agent framework combining fine-tuned LLMs, evidence retrieval, and bias detection to improve the reliability and safety of AI-driven medical question answering systems.

Contribution

It presents a novel multi-agent architecture that integrates specialized LLMs with evidence grounding and bias checks for clinical QA, enhancing answer accuracy and trustworthiness.

Findings

01

DeepSeek R1 outperforms BioGPT in benchmark scores.

02

The full system achieves 87% accuracy in clinical QA.

03

Evidence augmentation reduces response uncertainty.

Abstract

Large language models (LLMs) show promise for healthcare question answering, but clinical use is limited by weak verification, insufficient evidence grounding, and unreliable confidence signalling. We propose a multi-agent medical QA framework that combines complementary LLMs with evidence retrieval, uncertainty estimation, and bias checks to improve answer reliability. Our approach has two phases. First, we fine-tune three representative LLM families (GPT, LLaMA, and DeepSeek R1) on MedQuAD-derived medical QA data (20k+ question-answer pairs across multiple NIH domains) and benchmark generation quality. DeepSeek R1 achieves the strongest scores (ROUGE-1 0.536 +- 0.04; ROUGE-2 0.226 +-0.03; BLEU 0.098 -+ 0.018) and substantially outperforms the specialised biomedical baseline BioGPT in zero-shot evaluation. Second, we implement a modular multi-agent pipeline in which a Clinical…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsArtificial Intelligence in Healthcare and Education · Machine Learning in Healthcare · Topic Modeling