Systematic Evaluation of Vision Transformers for Automated Cervical Cancer Classification: Optimization, Statistical Validation, and Clinical Interpretability
Nisreen Albzour, Sarah S. Lam

TL;DR
This study systematically optimized Vision Transformer models for cervical cancer screening, achieving high accuracy and interpretability by aligning model attention with clinical features using the Herlev dataset.
Contribution
It introduces an optimized ViT architecture tailored for cervical cancer classification, demonstrating improved accuracy and interpretability over existing CNN-based methods.
Findings
Achieved 94.9%-95.2% cross-validation accuracy.
Identified effective augmentation and class weighting strategies.
Grad-CAM confirmed model attention aligns with clinical features.
Abstract
Manual Pap smear analysis for cervical cancer screening is limited by inter-observer variability, time constraints, and restricted expert availability. Although convolutional neural networks (CNNs) have automated cervical cell classification, they remain limited in modeling long-range spatial dependencies and often lack clinical interpretability. In this study, Vision Transformer (ViT) architectures were systematically optimized to enhance automated cervical cancer screening, which resulted in improved interpretability. The Herlev dataset (917 images: 242 normal, 675 abnormal) was utilized to optimize ViT-Tiny, a lightweight Vision Transformer architecture designed for reduced computational complexity, through a comprehensive evaluation of augmentation strategies, class weighting, and hyperparameters. The optimal configuration achieved 94.9%-95.2% cross-validation accuracy, in which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
