# Evaluation of large language models for diagnostic impression generation from brain MRI report findings: a multicenter benchmark and reader study

**Authors:** Ming-Liang Wang, Rui-Peng Zhang, Wen-Juan Wu, Yu Lu, Xiao-Er Wei, Zheng Sun, Bao-Hui Guan, Jun-Jie Zhang, Xue Wu, Lei Zhang, Tian-Le Wang, Yue-Hua Li

PMC · DOI: 10.1038/s41746-026-02380-4 · NPJ Digital Medicine · 2026-01-22

## TL;DR

This study evaluates large language models for generating brain MRI diagnostic impressions, finding that DeepSeek-R1 performs best and improves radiologists' accuracy and efficiency.

## Contribution

The study introduces a multicenter benchmark and reader study to assess LLMs for diagnostic impression generation in brain MRI reports.

## Key findings

- DeepSeek-R1 outperformed other models in diagnostic accuracy across various clinical scenarios.
- A top three differential-diagnosis prompting strategy achieved 97.6% patient-level accuracy.
- Radiologists using DeepSeek-R1 improved diagnostic accuracy and reduced reading time.

## Abstract

Automatically deriving radiological diagnoses from brain MRI report findings is challenging due to high complexity and domain expertise. This study evaluated 10 large language models (LLMs) in generating diagnoses from brain MRI report findings, using 4293 reports (9973 diagnostic labels) covering 15 brain disease categories from three medical centers. DeepSeek-R1 achieved the highest performance among the evaluated models on the full dataset and across different clinical scenarios and subgroups, particularly when provided with structured report findings and clinical information. A top three differential-diagnosis prompting strategy achieved superior performance, with 97.6% patient-level accuracy versus 87.1% for single-diagnosis prompting. The diagnostic performance of six radiologists was assessed with and without DeepSeek-R1 assistance on 500 reports. Integration of DeepSeek-R1 significantly improved diagnostic accuracy (AUPRC: 0.774–0.893) and reduced reading time (from 61 to 53 s), with more pronounced benefits for junior radiologists. Our findings indicate that effective automated diagnostic impression generation in brain MRI reporting requires advanced large-scale LLMs like DeepSeek-R1. With optimized prompting and input strategies, this framework may serve as a supportive tool in drafting brain MRI reports and contribute to enhanced workflow efficiency in radiology practice.

## Full-text entities

- **Diseases:** brain disease (MESH:D001927)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12929788/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12929788/full.md

## References

2 references — full list in the complete paper: https://tomesphere.com/paper/PMC12929788/full.md

---
Source: https://tomesphere.com/paper/PMC12929788