# Automated Tumor and Node Staging from Esophageal Cancer Endoscopic Ultrasound Reports: A Benchmark of Advanced Reasoning Models with Prompt Engineering and Cross-Lingual Evaluation

**Authors:** Xudong Hu, Lingde Feng, Bingzhong Jing, Linna Luo, Wencheng Tan, Yin Li, Xinyi Zheng, Xinxin Huang, Shiyong Lin, Huiling Wu, Longjun He

PMC · DOI: 10.3390/diagnostics16020215 · Diagnostics · 2026-01-09

## TL;DR

This study compares AI models for automatically extracting cancer staging from medical reports, finding that DeepSeek-R1 performs best across different languages and prompting conditions.

## Contribution

The study introduces a benchmark for AI models in automated T/N staging from EUS reports, highlighting DeepSeek-R1's superior performance and robustness.

## Key findings

- DeepSeek-R1 outperformed GPT-4o, Qwen3, and Grok-3 in both T- and N-staging tasks.
- N-staging was more challenging than T-staging for all models, with lower overall accuracy.
- DeepSeek-R1 showed strongest advantage in Chinese without-prompt T-staging and English without-prompt N-staging scenarios.

## Abstract

Objectives: To benchmark the performance of DeepSeek-R1 against three other advanced AI reasoning models (GPT-4o, Qwen3, Grok-3) in automatically extracting T/N staging from esophageal cancer endoscopic ultrasound (EUS) complex medical reports, and to evaluate the impact of language (Chinese/English) and prompting strategy (with/without designed prompt) on model accuracy and robustness. Methods: We retrospectively analyzed 625 EUS reports for T-staging and 579 for N-staging, which were collected from 663 patients at the Sun Yat-sen University Cancer Center between 2018 and 2020. A 2 × 2 factorial design (Language × Prompt) was employed under a zero-shot setting. The performance of the models was evaluated using accuracy, and the odds ratio (OR) was calculated to quantify the comparative performance advantage between models across different scenarios. Results: Performance was evaluated across four scenarios: (1) Chinese with-prompt, (2) Chinese without-prompt, (3) English with-prompt, and (4) English without-prompt. In both T and N-staging tasks, DeepSeek-R1 demonstrated superior overall performance compared to the competitors. For T-staging, the average accuracy was (DeepSeek-R1 vs. GPT-4o vs. Qwen3 vs. Grok-3: 91.4% vs. 84.2% vs. 89.5% vs. 81.3%). For N-staging, the respective average accuracy was 84.2% vs. 65.0% vs. 68.4% vs. 51.9%. Notably, N-staging proved more challenging than T-staging for all models, as indicated by lower accuracy. This superiority was most pronounced in the Chinese without-prompt T-staging scenario, where DeepSeek-R1 achieved significantly higher accuracy than GPT-4o (OR = 7.84, 95% CI [4.62–13.30], p < 0.001), Qwen3 (OR = 5.00, 95% CI [2.85–8.79], p < 0.001), and Grok-3 (OR = 6.47, 95% CI [4.30–9.74], p < 0.001). Conclusions: This study validates the feasibility and effectiveness of large language models (LLMs) for automated T/N staging from EUS reports. Our findings confirm that DeepSeek-R1 possesses strong intrinsic reasoning capabilities, achieving the most robust performance across diverse conditions, with the most pronounced advantage observed in the challenging English without-prompt N-staging task. By establishing a standardized, objective benchmark, DeepSeek-R1 mitigates inter-observer variability, and its deployment provides a reliable foundation for guiding precise, individualized treatment planning for esophageal cancer patients.

## Linked entities

- **Diseases:** esophageal cancer (MONDO:0007576)

## Full-text entities

- **Diseases:** Esophageal Cancer (MESH:D004938), Cancer (MESH:D009369)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12839693/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12839693/full.md

## References

47 references — full list in the complete paper: https://tomesphere.com/paper/PMC12839693/full.md

---
Source: https://tomesphere.com/paper/PMC12839693