Architecting Clinical Collaboration: Multi-Agent Reasoning Systems for Multimodal Medical VQA

Karishma Thakrar; Shreyas Basavatia; Akshay Daftardar

arXiv:2507.05520·cs.AI·August 27, 2025

Architecting Clinical Collaboration: Multi-Agent Reasoning Systems for Multimodal Medical VQA

Karishma Thakrar, Shreyas Basavatia, Akshay Daftardar

PDF

TL;DR

This paper explores multi-agent reasoning systems for medical visual question answering, demonstrating that clinical-inspired architectures outperform fine-tuned models by providing accurate, explainable, and literature-grounded diagnoses in dermatology telemedicine.

Contribution

It introduces a novel multi-agent reasoning approach that mimics clinical collaboration, improving accuracy and explainability over traditional fine-tuning methods in medical VQA.

Findings

01

Clinical-inspired architectures achieved up to 70% accuracy.

02

Fine-tuning degraded performance in most models.

03

Models provided explainable, literature-grounded outputs.

Abstract

Dermatological care via telemedicine often lacks the rich context of in-person visits. Clinicians must make diagnoses based on a handful of images and brief descriptions, without the benefit of physical exams, second opinions, or reference materials. While many medical AI systems attempt to bridge these gaps with domain-specific fine-tuning, this work hypothesized that mimicking clinical reasoning processes could offer a more effective path forward. This study tested seven vision-language models on medical visual question answering across six configurations: baseline models, fine-tuned variants, and both augmented with either reasoning layers that combine multiple model perspectives, analogous to peer consultation, or retrieval-augmented generation that incorporates medical literature at inference time, serving a role similar to reference-checking. While fine-tuning degraded performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.