Attend what matters: Leveraging vision foundational models for breast cancer classification using mammograms

Samyak Sanghvi; Piyush Miglani; Sarvesh Shashikumar; Kaustubh R Borgavi; Veenu Singla; Chetan Arora

arXiv:2604.19350·cs.CV·April 22, 2026

Attend what matters: Leveraging vision foundational models for breast cancer classification using mammograms

Samyak Sanghvi, Piyush Miglani, Sarvesh Shashikumar, Kaustubh R Borgavi, Veenu Singla, Chetan Arora

PDF

1 Repo

TL;DR

This paper introduces a novel framework leveraging region-of-interest token reduction, contrastive learning, and pretrained vision transformers to improve breast cancer classification from mammograms, addressing challenges of high-resolution images and fine-grained distinctions.

Contribution

It proposes a new approach combining RoI-based token reduction, contrastive learning, and pretrained ViT models to enhance mammogram classification accuracy.

Findings

01

Achieves superior performance over existing baselines on public datasets.

02

Demonstrates the effectiveness of RoI-guided attention and contrastive learning in fine-grained medical image classification.

03

Establishes potential clinical utility for large-scale breast cancer screening.

Abstract

Vision Transformers $(ViT)$ have become the architecture of choice for many computer vision tasks, yet their performance in computer-aided diagnostics remains limited. Focusing on breast cancer detection from mammograms, we identify two main causes for this shortfall. First, medical images are high-resolution with small abnormalities, leading to an excessive number of tokens and making it difficult for the softmax-based attention to localize and attend to relevant regions. Second, medical image classification is inherently fine-grained, with low inter-class and high intra-class variability, where standard cross-entropy training is insufficient. To overcome these challenges, we propose a framework with three key components: (1) Region of interest $(RoI)$ based token reduction using an object detection model to guide attention; (2) contrastive learning between selected…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

https://aih-iitd.github.io/publications/attend-what-matters
github

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.