An Attention Infused Deep Learning System with Grad-CAM Visualization for Early Screening of Glaucoma

Ramanathan Swaminathan

arXiv:2505.17808·cs.CV·January 15, 2026

An Attention Infused Deep Learning System with Grad-CAM Visualization for Early Screening of Glaucoma

Ramanathan Swaminathan

PDF

TL;DR

This paper presents a novel deep learning system combining CNN and Vision Transformer with Cross-Attention for early glaucoma detection, demonstrating improved accuracy on ACRIMA and Drishti datasets.

Contribution

It introduces a fused CNN-ViT model with Cross-Attention, enhancing interpretability and performance in glaucoma screening.

Findings

01

Improved detection accuracy over baseline models

02

Effective visualization of clinically relevant regions

03

Successful integration of CNN and ViT for medical imaging

Abstract

This research work reveals the strengths of intertwining a deep custom convolutional neural network with a disruptive Vision Transformer, both fused together with a radical Cross-Attention module. Here, two high-yielding datasets for artificial intelligence models in detecting glaucoma, namely ACRIMA and Drishti, are utilized. The Cross-Attention mechanism facilitates the model in learning regions in the fundus that are clinically relevant through bidirectional feature exchange between CNN and ViT streams. Experiments clearly depict improved performance when compared to standalone baseline CNN and ViT models.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.