# Comparative Analysis of Transformer Architectures and Ensemble Methods for Automated Glaucoma Screening in Fundus Images from Portable Ophthalmoscopes

**Authors:** Rodrigo Otávio Cantanhede Costa, Pedro Alexandre Ferreira França, Alexandre César Pinto Pessoa, Geraldo Braz Júnior, João Dallyson Sousa de Almeida, António Cunha

PMC · DOI: 10.3390/vision9040093 · Vision · 2025-11-03

## TL;DR

This paper explores using Transformer models and ensembles to detect glaucoma in low-quality images from portable devices, showing improved accuracy and accessibility for early diagnosis.

## Contribution

The novel contribution is demonstrating that Transformer ensembles improve glaucoma detection accuracy on low-quality images from portable ophthalmoscopes.

## Key findings

- Ensemble methods and patient-level aggregation significantly improve accuracy and sensitivity for glaucoma detection.
- Transformer models achieved up to 85% accuracy and 84.2% F1-score on the D-Eye dataset with reduced false negatives.
- Grad-CAM attention maps show Transformers focus on anatomically relevant regions for diagnosis.

## Abstract

Deep learning for glaucoma screening often relies on high-resolution clinical images and convolutional neural networks (CNNs). However, these methods face significant performance drops when applied to noisy, low-resolution images from portable devices. To address this, our work investigates ensemble methods using multiple Transformer architectures for automated glaucoma detection in challenging scenarios. We use the Brazil Glaucoma (BrG) and private D-Eye datasets to assess model robustness. These datasets include images typical of smartphone-coupled ophthalmoscopes, which are often noisy and variable in quality. Four Transformer models—Swin-Tiny, ViT-Base, MobileViT-Small, and DeiT-Base—were trained and evaluated both individually and in ensembles. We evaluated the results at both image and patient levels to reflect clinical practice. The results show that, although performance drops on lower-quality images, ensemble combinations and patient-level aggregation significantly improve accuracy and sensitivity. We achieved up to 85% accuracy and an 84.2% F1-score on the D-Eye dataset, with a notable reduction in false negatives. Grad-CAM attention maps confirmed that Transformers identify anatomical regions relevant to diagnosis. These findings reinforce the potential of Transformer ensembles as an accessible solution for early glaucoma detection in populations with limited access to specialized equipment.

## Linked entities

- **Diseases:** glaucoma (MONDO:0005041)

## Full-text entities

- **Diseases:** BrG (MESH:D005901)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12641837/full.md

## Figures

7 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12641837/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/PMC12641837/full.md

---
Source: https://tomesphere.com/paper/PMC12641837