# A comparative accuracy study of multimodal LLMs, VLM and agent-based framework for pulmonary nodule detection on chest radiographs

**Authors:** Daria Khovanova, Yuriy Vasilev, Anton Vladzymyrskyy, Olga Omelyanskaya, Anastasia Pamova, Kirill Arzamasov

PMC · DOI: 10.3389/fdgth.2025.1674835 · Frontiers in Digital Health · 2026-01-27

## TL;DR

This study compares the accuracy of various AI models in detecting lung nodules on chest X-rays, finding that MedRAX and BiomedCLIP perform best but all models lack clinical readiness.

## Contribution

The study provides a comparative evaluation of multimodal LLMs and vision-language models for pulmonary nodule detection on chest radiographs.

## Key findings

- MedRAX and BiomedCLIP achieved the highest accuracy of 0.711 for pulmonary nodule detection.
- Proprietary models like Claude 3.7 Sonnet showed better performance than open-source models but not significantly.
- Overall model accuracy was insufficient for clinical use due to limitations like small dataset size and image format.

## Abstract

Artificial intelligence technologies are being actively introduced in clinical practice. The most promising solutions are AI-assistants based on large language models (LLMs). Determining the feasibility of integrating such applications in clinical practice requires independent performance assessments. This study assessed accuracy of several multimodal LLMs in detecting pulmonary nodules on chest radiographs (CXR).

This study included 9 models: Llama 3.2 Vision 90B, Claude 3.5 Sonnet, Claude 3.7 Sonnet, Gemini 2.0 Pro Experimental, Perplexity, CXR-LLaVA, XrayGPT, BiomedCLIP, MedRAX. Each model determined presence or absence of pulmonary nodules in dataset containing 100 CXR, 50 of which contained pulmonary nodules. ROC curves were constructed, diagnostic accuracy metrics were calculated. McNemar's test was used for pairwise accuracy comparisons.

Best results were achieved by MedRAX framework and BiomedCLIP vision-language model, with accuracy of 0.711 (95% CI 0.613–0.808). Among proprietary single-model LLMs, Claude 3.7 Sonnet demonstrated the best performance: accuracy 0.651 (0.548–0.753). Llama 3.2 Vision 90B, Claude 3.5 Sonnet, Gemini 2.0 Pro Experimental demonstrated matching accuracy values: 0.602 (0.497–0.708).

MedRAX framework and BiomedCLIP vision-language model showed the highest accuracy values. No statistically significant difference was observed between proprietary and open-source models, which may indicate potential for improving accuracy through refinement of open-source LLM-based models. Overall, accuracy values of evaluated models were insufficient for current clinical practice implementation. These results should be seen as exploratory given the small dataset size, single-centre design, different prompting strategies for foundation and domain-adapted models and use of PNG images instead of DICOM.

## Full-text entities

- **Diseases:** pulmonary nodule (MESH:D055613)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12887855/full.md

## Figures

3 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12887855/full.md

## References

33 references — full list in the complete paper: https://tomesphere.com/paper/PMC12887855/full.md

---
Source: https://tomesphere.com/paper/PMC12887855