# An Interpretable Ensemble Transformer Framework for Breast Cancer Detection in Ultrasound Images

**Authors:** Riyadh M. Al-Tam, Aymen M. Al-Hejri, Fatma A. Hashim, Sachin M. Narangale, Mugahed A. Al-Antari, Sarah A. Alzakari

PMC · DOI: 10.3390/diagnostics16040622 · Diagnostics · 2026-02-20

## TL;DR

This paper introduces a new AI system that helps detect breast cancer in ultrasound images with high accuracy and provides visual explanations for its decisions.

## Contribution

The novel contribution is an interpretable ensemble framework using Vision Transformers for breast cancer detection with strong generalization and explainability.

## Key findings

- The ensemble model achieved 96.92% accuracy and 97.10% AUC for binary classification of breast cancer.
- It showed strong generalizability across independent datasets with over 86% accuracy in external validation.
- Performance dropped to 68.75% accuracy for fine-grained BI-RADS classification, highlighting subclassification challenges.

## Abstract

Background/Objectives: Early and accurate detection of breast cancer is essential for reducing mortality and improving patient outcomes. However, the manual interpretation of breast ultrasound images is challenging due to image variability, noise, and inter-observer subjectivity. This study aims to address these limitations by developing an automated and interpretable computer-aided diagnosis (CAD) system. Methods: We propose an automated and interpretable computer-aided diagnosis (CAD) system that integrates ensemble transfer learning with Vision Transformer architectures. The system combines the Data-Efficient Image Transformer (Deit) and Vision Transformer (ViT) through concatenation-based feature fusion to exploit their complementary representations. Preprocessing, normalization, and targeted data augmentation enhance robustness, while Gradient-weighted Class Activation Mapping (Grad-CAM) provides visual explanations to support clinical interpretability. The proposed model is benchmarked against state-of-the-art CNNs (VGG16, ResNet50, DenseNet201) and Transformer models (ViT, DeiT, Swin, Beit) using the Breast Ultrasound Images (BUSI) dataset. Results: The ensemble achieved 96.92% accuracy and 97.10% AUC for binary classification, and 94.27% accuracy with 94.81% AUC for three-class classification. External validation on independent datasets demonstrated strong generalizability, with 87.76%/88.07% accuracy/AUC on BrEaST, 86.77%/85.90% on BUS-BRA, and 86.99%/86.99% on BUSI_WHU. Performance decreased for fine-grained BI-RADS classification—76.68%/84.59% accuracy/AUC on BUS-BRA and 68.75%/81.10% on BrEaST—reflecting the inherent complexity and subjectivity of clinical subclassification. Conclusions: The proposed Vision Transformer-based ensemble demonstrates high diagnostic accuracy, strong cross-dataset generalization, and clinically meaningful explainability. These findings highlight its potential as a reliable second-opinion CAD tool for breast cancer diagnosis, particularly in resource-limited clinical environments.

## Linked entities

- **Diseases:** breast cancer (MONDO:0004989)

## Full-text entities

- **Genes:** VIT (vitrin) [NCBI Gene 5212] {aka VIT1}
- **Diseases:** lung cancer (MESH:D008175), cancer (MESH:D009369), injury to (MESH:D014947), obesity (MESH:D009765), BUS (MESH:D061325), deaths (MESH:D003643), cysts (MESH:D003560), benign lesion (MESH:D001932), CAD (MESH:C000719218), Breast Cancer (MESH:D001943)
- **Chemicals:** alcohol (MESH:D000438)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12939200/full.md

## Figures

12 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12939200/full.md

## References

81 references — full list in the complete paper: https://tomesphere.com/paper/PMC12939200/full.md

---
Source: https://tomesphere.com/paper/PMC12939200