# Benchmark-Driven Clinical Decision Framework for Multi-Class Middle Ear Disease Diagnosis: Superiority of Swin Transformer in Accuracy and Stability

**Authors:** Guoping Chen, Haoyi Zhang, Junbo Zeng, Yuexin Cai, Dong Huang, Yubin Chen, Peng Li, Yiqing Zheng

PMC · DOI: 10.3390/diagnostics16030482 · Diagnostics · 2026-02-05

## TL;DR

This paper introduces a new AI framework for diagnosing middle ear diseases using Swin Transformer, achieving high accuracy and stability.

## Contribution

The study introduces a probability-guided Top-K clinical decision framework that ensures high accuracy and complete case coverage.

## Key findings

- Swin Transformer achieved 95.53% accuracy and 93.37% Macro-F1 score in diagnosing middle ear diseases.
- The Swin Transformer model showed exceptional stability with 95.61% ± 0.38% accuracy in cross-validation.
- A probability-guided Top-2 decision framework achieved 93.25% accuracy with 100% case coverage.

## Abstract

Background/Objectives: The variable accuracy of middle ear disease diagnosis based on oto-endoscopy underscores the need for improved decision support. Although convolutional Neural Networks (CNNs) are currently a mainstay of computer-aided diagnosis (CAD), their constraints in global feature integration persist. We therefore systematically benchmarked state-of-the-art CNNs and Transformers to establish a performance baseline. Beyond this benchmark, our primary contribution is the development of a probability-guided Top-K clinical decision framework that balances high accuracy with complete case coverage for practical deployment. Methods: Using a multicenter dataset of 6361 images (five categories), we implemented a two-stage validation strategy (fixed-split followed by 5-fold cross-validation). A comprehensive comparison was performed among leading CNNs and Transformer variants assessed by accuracy and Macro-F1 score. Results: The Swin Transformer model demonstrated superior performance, achieving an accuracy of 95.53% and a Macro-F1 score of 93.37%. It exhibited exceptional stability (95.61% ± 0.38% in cross-validation) and inherent robustness to class imbalance. A probability-guided Top-2 decision framework was developed, achieving 93.25% accuracy with 100% case coverage. Conclusions: This rigorous benchmark established Swin Transformer as the most effective architecture. Consequently, this study delivers not only a performance benchmark but also a clinically actionable decision-support framework, thereby facilitating the deployment of AI-assisted diagnosis for chronic middle ear conditions in specialist otology.

## Linked entities

- **Diseases:** middle ear disease (MONDO:0003276)

## Full-text entities

- **Diseases:** Middle Ear Disease (MESH:D010033)

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12897263/full.md

## Figures

9 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12897263/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/PMC12897263/full.md

---
Source: https://tomesphere.com/paper/PMC12897263