Architecting Clinical Collaboration: Multi-Agent Reasoning Systems for Multimodal Medical VQA
Karishma Thakrar, Shreyas Basavatia, Akshay Daftardar

TL;DR
This paper explores multi-agent reasoning systems for medical visual question answering, demonstrating that clinical-inspired architectures outperform fine-tuned models by providing accurate, explainable, and literature-grounded diagnoses in dermatology telemedicine.
Contribution
It introduces a novel multi-agent reasoning approach that mimics clinical collaboration, improving accuracy and explainability over traditional fine-tuning methods in medical VQA.
Findings
Clinical-inspired architectures achieved up to 70% accuracy.
Fine-tuning degraded performance in most models.
Models provided explainable, literature-grounded outputs.
Abstract
Dermatological care via telemedicine often lacks the rich context of in-person visits. Clinicians must make diagnoses based on a handful of images and brief descriptions, without the benefit of physical exams, second opinions, or reference materials. While many medical AI systems attempt to bridge these gaps with domain-specific fine-tuning, this work hypothesized that mimicking clinical reasoning processes could offer a more effective path forward. This study tested seven vision-language models on medical visual question answering across six configurations: baseline models, fine-tuned variants, and both augmented with either reasoning layers that combine multiple model perspectives, analogous to peer consultation, or retrieval-augmented generation that incorporates medical literature at inference time, serving a role similar to reference-checking. While fine-tuning degraded performance…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
