VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results
Bo-Cheng Qiu, Fang-Ying Lin, Ming-Han Sun, Yu-Fan Lin, Chia-Ming Lee, Chih-Chung Hsu

TL;DR
VISTA is a novel framework for rare-pathology capsule endoscopy event detection that combines multiple models and validation techniques, achieving high accuracy and ranking second in a competition.
Contribution
The paper introduces VISTA, a multi-backbone, validation-guided framework that effectively integrates spatial and temporal models for improved rare-pathology event detection.
Findings
Achieved a post-competition [email protected] of 0.3726
Achieved a post-competition [email protected] of 0.3431
Ranked second in the competition evaluation
Abstract
Capsule endoscopy event detection is challenging because clinically relevant findings are sparse, visually heterogeneous, and evaluated at the event level rather than by frame accuracy. We propose VISTA, a metric-aligned multi-backbone framework for the RAREVISION task. VISTA combines EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, followed by a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). The original official submission achieved hidden-test temporal [email protected] of 0.3530 and [email protected] of 0.3235. After the competition, extending local threshold refinement with a global coarse search improved performance to 0.3726 [email protected] and 0.3431 [email protected], ranking Team ACVLab second in the post-competition evaluation.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
