VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results

Bo-Cheng Qiu; Fang-Ying Lin; Ming-Han Sun; Yu-Fan Lin; Chia-Ming Lee; Chih-Chung Hsu

arXiv:2605.22096·cs.CV·May 22, 2026

VISTA: Validation-Guided Integration of Spatial and Temporal Foundation Models with Anatomical Decoding for Rare-Pathology VCE Event Detection -- after competition results

Bo-Cheng Qiu, Fang-Ying Lin, Ming-Han Sun, Yu-Fan Lin, Chia-Ming Lee, Chih-Chung Hsu

PDF

TL;DR

VISTA is a novel framework for rare-pathology capsule endoscopy event detection that combines multiple models and validation techniques, achieving high accuracy and ranking second in a competition.

Contribution

The paper introduces VISTA, a multi-backbone, validation-guided framework that effectively integrates spatial and temporal models for improved rare-pathology event detection.

Findings

01

Achieved a post-competition [email protected] of 0.3726

02

Achieved a post-competition [email protected] of 0.3431

03

Ranked second in the competition evaluation

Abstract

Capsule endoscopy event detection is challenging because clinically relevant findings are sparse, visually heterogeneous, and evaluated at the event level rather than by frame accuracy. We propose VISTA, a metric-aligned multi-backbone framework for the RAREVISION task. VISTA combines EndoFM-LV for temporal context and DINOv3 ViTL/16 for frame-level visual semantics, followed by a Diverse Head Ensemble (DHE), Validation-Guided Weighted Fusion (VGWF), and Anatomy-Aware Temporal Event Decoding (ATED). The original official submission achieved hidden-test temporal [email protected] of 0.3530 and [email protected] of 0.3235. After the competition, extending local threshold refinement with a global coarse search improved performance to 0.3726 [email protected] and 0.3431 [email protected], ranking Team ACVLab second in the post-competition evaluation.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.