BREATH-VL: Vision-Language-Guided 6-DoF Bronchoscopy Localization via Semantic-Geometric Fusion
Qingyao Tian, Bingyu Yang, Huai Liao, Xinyan Huang, Junyong Li, Dong Yi, Hongbin Liu

TL;DR
This paper introduces BREATH-VL, a hybrid vision-language framework for accurate 6-DoF bronchoscopy localization, leveraging a new in-vivo dataset and semantic-geometric fusion to improve accuracy and robustness in complex airway navigation.
Contribution
The paper presents BREATH-VL, the first in-vivo endoscopic localization dataset and a hybrid framework combining vision-language cues with geometric registration for enhanced 6-DoF pose estimation.
Findings
Reduces translational error by 25.5% compared to state-of-the-art methods.
Demonstrates robust semantic localization in challenging surgical scenes.
Achieves competitive computational latency with improved accuracy.
Abstract
Vision-language models (VLMs) have recently shown remarkable performance in navigation and localization tasks by leveraging large-scale pretraining for semantic understanding. However, applying VLMs to 6-DoF endoscopic camera localization presents several challenges: 1) the lack of large-scale, high-quality, densely annotated, and localization-oriented vision-language datasets in real-world medical settings; 2) limited capability for fine-grained pose regression; and 3) high computational latency when extracting temporal features from past frames. To address these issues, we first construct BREATH dataset, the largest in-vivo endoscopic localization dataset to date, collected in the complex human airway. Building on this dataset, we propose BREATH-VL, a hybrid framework that integrates semantic cues from VLMs with geometric information from vision-based registration methods for accurate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Surgical Simulation and Training · Advanced Neural Network Applications
