CapsoNet: A CNN-Transformer Ensemble for Multi-Class Abnormality Detection in Video Capsule Endoscopy
Arnav Samal, Ranya Batsyas

TL;DR
CapsoNet is a novel deep learning ensemble combining CNNs and transformers for multi-class abnormality detection in video capsule endoscopy, achieving high accuracy and AUC in a competitive challenge.
Contribution
The paper introduces CapsoNet, a CNN-transformer ensemble specifically designed for VCE abnormality classification, with innovative training strategies to handle class imbalance.
Findings
Achieved 86.34% balanced accuracy on validation set.
Secured 5th place in Capsule Vision 2024 Challenge.
Demonstrated effectiveness of ensemble and data augmentation techniques.
Abstract
We present CapsoNet, a deep learning framework developed for the Capsule Vision 2024 Challenge, designed to perform multi-class abnormality classification in video capsule endoscopy (VCE) frames. CapsoNet leverages an ensemble of convolutional neural networks (CNNs) and transformer-based architectures to capture both local and global visual features. The model was trained and evaluated on a dataset of over 50,000 annotated frames spanning ten abnormality classes, sourced from three public and one private dataset. To address the challenge of class imbalance, we employed focal loss, weighted random sampling, and extensive data augmentation strategies. All models were fully fine-tuned to maximize performance within the ensemble. CapsoNet achieved a balanced accuracy of 86.34 percent and a mean AUC-ROC of 0.9908 on the official validation set, securing Team Seq2Cure 5th place in the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGastrointestinal Bleeding Diagnosis and Treatment
