A Multi-agent Large Language Model Framework to Automatically Assess Performance of a Clinical AI Triage Tool

Adam E. Flanders; Yifan Peng; Luciano Prevedello; Robyn Ball; Errol Colak; Prahlad Menon; George Shih; Hui-Ming Lin; Paras Lakhani

arXiv:2510.26498·cs.CL·October 31, 2025

A Multi-agent Large Language Model Framework to Automatically Assess Performance of a Clinical AI Triage Tool

Adam E. Flanders, Yifan Peng, Luciano Prevedello, Robyn Ball, Errol Colak, Prahlad Menon, George Shih, Hui-Ming Lin, Paras Lakhani

PDF

TL;DR

This study demonstrates that an ensemble of multiple large language models can more reliably evaluate the performance of a clinical AI triage tool on head CT exams than individual models, improving consistency in retrospective assessments.

Contribution

Introduces a multi-agent LLM framework that enhances reliability in assessing clinical AI tools compared to single LLM evaluations.

Findings

01

Ensemble of LLMs outperforms individual models in AUC and F1 scores.

02

Medium to large open-source LLM ensembles provide consistent performance evaluation.

03

No significant difference between top-performing ensembles and GPT-4o in MCC.

Abstract

Purpose: The purpose of this study was to determine if an ensemble of multiple LLM agents could be used collectively to provide a more reliable assessment of a pixel-based AI triage tool than a single LLM. Methods: 29,766 non-contrast CT head exams from fourteen hospitals were processed by a commercial intracranial hemorrhage (ICH) AI detection tool. Radiology reports were analyzed by an ensemble of eight open-source LLM models and a HIPAA compliant internal version of GPT-4o using a single multi-shot prompt that assessed for presence of ICH. 1,726 examples were manually reviewed. Performance characteristics of the eight open-source models and consensus were compared to GPT-4o. Three ideal consensus LLM ensembles were tested for rating the performance of the triage tool. Results: The cohort consisted of 29,766 head CTs exam-report pairs. The highest AUC performance was achieved with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.