ViT-2SPN: Vision Transformer-based Dual-Stream Self-Supervised Pretraining Networks for Retinal OCT Classification
Mohammadreza Saraei, Igor Kozak, Eung-Joo Lee

TL;DR
This paper introduces ViT-2SPN, a self-supervised pretraining framework using Vision Transformers for improved retinal OCT classification, addressing data scarcity and privacy issues in medical imaging.
Contribution
The paper presents a novel dual-stream self-supervised pretraining approach with a Vision Transformer backbone, enhancing feature extraction for OCT diagnosis.
Findings
Achieved a mean AUC of 0.93 on OCT classification
Outperformed existing self-supervised methods in accuracy and F1 score
Demonstrated effectiveness with limited labeled data
Abstract
Optical Coherence Tomography (OCT) is a non-invasive imaging modality essential for diagnosing various eye diseases. Despite its clinical significance, developing OCT-based diagnostic tools faces challenges, such as limited public datasets, sparse annotations, and privacy concerns. Although deep learning has made progress in automating OCT analysis, these challenges remain unresolved. To address these limitations, we introduce the Vision Transformer-based Dual-Stream Self-Supervised Pretraining Network (ViT-2SPN), a novel framework designed to enhance feature extraction and improve diagnostic accuracy. ViT-2SPN employs a three-stage workflow: Supervised Pretraining, Self-Supervised Pretraining (SSP), and Supervised Fine-Tuning. The pretraining phase leverages the OCTMNIST dataset (97,477 unlabeled images across four disease classes) with data augmentation to create dual-augmented views.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRetinal Imaging and Analysis · Brain Tumor Detection and Classification
MethodsAttention Is All You Need · Softmax · Adam · Residual Connection · Dropout · Absolute Position Encodings · Byte Pair Encoding · Linear Layer · Vision Transformer · Multi-Head Attention
