Divide-then-Diagnose: Weaving Clinician-Inspired Contexts for Ultra-Long Capsule Endoscopy Videos
Bowen Liu, Li Yang, Shanshan Song, Mingyu Tang, Zhifang Gao, Qifeng Chen, Yangqiu Song, Huimin Chen, and Xiaomeng Li

TL;DR
This paper introduces a new task and dataset for diagnosis-driven capsule endoscopy video summarization, emphasizing the importance of contextual reasoning for extracting clinically relevant evidence from ultra-long videos.
Contribution
It presents VideoCAP, a novel dataset with diagnosis annotations, and DiCE, a framework inspired by clinical workflows that improves video summarization accuracy.
Findings
DiCE outperforms existing methods in clinical reliability and conciseness.
VideoCAP provides realistic supervision for evidence extraction and diagnosis.
Contextual reasoning enhances ultra-long video summarization in medical imaging.
Abstract
Capsule endoscopy (CE) enables non-invasive gastrointestinal screening, but current CE research remains largely limited to frame-level classification and detection, leaving video-level analysis underexplored. To bridge this gap, we introduce and formally define a new task, diagnosis-driven CE video summarization, which requires extracting key evidence frames that covers clinically meaningful findings and making accurate diagnoses from those evidence frames. This setting is challenging because diagnostically relevant events are extremely sparse and can be overwhelmed by tens of thousands of redundant normal frames, while individual observations are often ambiguous due to motion blur, debris, specular highlights, and rapid viewpoint changes. To facilitate research in this direction, we introduce VideoCAP, the first CE dataset with diagnosis-driven annotations derived from real clinical…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
