Evaluating Multi-Agent LLM Architectures for Rare Disease Diagnosis
Ahmed Almasoud

TL;DR
This study evaluates four multi-agent LLM architectures for rare disease diagnosis, introducing a Reasoning Gap metric, and finds that hierarchical topology slightly outperforms others, while complexity does not always improve accuracy.
Contribution
It systematically compares multi-agent topologies for rare disease diagnosis and introduces a new metric to assess reasoning quality, highlighting the impact of architecture design.
Findings
Hierarchical topology achieves 50.0% accuracy.
Adversarial model significantly reduces accuracy to 27.3%.
Multi-agent systems outperform single-agent in Bone and Thoracic diseases.
Abstract
While large language models are capable diagnostic tools, the impact of multi-agent topology on diagnostic accuracy remains underexplored. This study evaluates four agent topologies, Control (single agent), Hierarchical, Adversarial, and Collaborative, across 302 cases spanning 33 rare disease categories. We introduce a Reasoning Gap metric to quantify the difference between internal knowledge retrieval and final diagnostic accuracy. Results indicate that the Hierarchical topology (50.0% accuracy) marginally outperforms Collaborative (49.8%) and Control (48.5%) configurations. In contrast, the Adversarial model significantly degrades performance (27.3%), exhibiting a massive Reasoning Gap where valid diagnoses were rejected due to artificial doubt. Across all architectures, performance was strongest in Allergic diseases and Toxic Effects categories but poorest in Cardiac Malformation…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenomics and Rare Diseases · Explainable Artificial Intelligence (XAI) · Topic Modeling
