# AACNN-ViT: Adaptive Attention-Augmented Convolutional and Vision Transformer Fusion for Lung Cancer Detection

**Authors:** Mohammad Ishtiaque Rahman, Amrina Rahman

PMC · DOI: 10.3390/jimaging12020062 · Journal of Imaging · 2026-01-30

## TL;DR

This paper introduces a new AI model for detecting lung cancer from CT scans that combines convolutional and transformer networks to improve accuracy and handle class imbalances.

## Contribution

The novel AACNN-ViT framework uses adaptive attention fusion to combine CNN and ViT features for better lung cancer classification.

## Key findings

- AACNN-ViT achieved 96.97% accuracy on lung cancer classification with strong minority-class recognition.
- The model outperformed CNN-ViT with a macro-F1 score of 0.9458 versus 0.7680.
- One-vs.-rest ROC analysis showed strong class separability with a micro-average AUC of 0.992.

## Abstract

Lung cancer remains a leading cause of cancer-related mortality. Although reliable multiclass classification of lung lesions from CT imaging is essential for early diagnosis, it remains challenging due to subtle inter-class differences, limited sample sizes, and class imbalance. We propose an Adaptive Attention-Augmented Convolutional Neural Network with Vision Transformer (AACNN-ViT), a hybrid framework that integrates local convolutional representations with global transformer embeddings through an adaptive attention-based fusion module. The CNN branch captures fine-grained spatial patterns, the ViT branch encodes long-range contextual dependencies, and the adaptive fusion mechanism learns to weight cross-representation interactions to improve discriminability. To reduce the impact of imbalance, a hybrid objective that combines focal loss with categorical cross-entropy is incorporated during training. Experiments on the IQ-OTH/NCCD dataset (benign, malignant, and normal) show consistent performance progression in an ablation-style evaluation: CNN-only, ViT-only, CNN-ViT concatenation, and AACNN-ViT. The proposed AACNN-ViT achieved 96.97% accuracy on the validation set with macro-averaged precision/recall/F1 of 0.9588/0.9352/0.9458 and weighted F1 of 0.9693, substantially improving minority-class recognition (Benign recall 0.8333) compared with CNN-ViT (accuracy 89.09%, macro-F1 0.7680). One-vs.-rest ROC analysis further indicates strong separability across all classes (micro-average AUC 0.992). These results suggest that adaptive attention-based fusion offers a robust and clinically relevant approach for computer-aided lung cancer screening and decision support.

## Linked entities

- **Diseases:** lung cancer (MONDO:0005138)

## Full-text entities

- **Genes:** VIT (vitrin) [NCBI Gene 5212] {aka VIT1}
- **Diseases:** injury to (MESH:D014947), Lung Cancer (MESH:D008175), lung abnormalities (MESH:D008171), Cancer Diseases (MESH:D009369), CT (MESH:C000719218)
- **Chemicals:** Img (-)
- **Species:** Homo sapiens (human, species) [taxon 9606]

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/PMC12941408/full.md

## Figures

4 figures with captions in the complete paper: https://tomesphere.com/paper/PMC12941408/full.md

## References

31 references — full list in the complete paper: https://tomesphere.com/paper/PMC12941408/full.md

---
Source: https://tomesphere.com/paper/PMC12941408