EndoMatcher: Generalizable Endoscopic Image Matcher via Multi-Domain Pre-training for Robot-Assisted Surgery
Bingyu Yang, Qingyao Tian, Yimeng Geng, Huai Liao, Xinyan Huang, Jiebo Luo, and Hongbin Liu

TL;DR
EndoMatcher is a novel endoscopic image matching method that uses multi-domain pre-training and a two-branch Vision Transformer to achieve robust, zero-shot performance across diverse challenging conditions in robot-assisted surgery.
Contribution
It introduces EndoMatcher, a new generalizable endoscopic image matcher trained on a large multi-domain dataset with a progressive multi-objective strategy for improved robustness.
Findings
Increases inlier matches by over 140% on benchmark datasets.
Achieves 9.4% higher MDPA on Gastro-Matching.
Demonstrates strong zero-shot generalization across unseen organs and conditions.
Abstract
Generalizable dense feature matching in endoscopic images is crucial for robot-assisted tasks, including 3D reconstruction, navigation, and surgical scene understanding. Yet, it remains a challenge due to difficult visual conditions (e.g., weak textures, large viewpoint variations) and a scarcity of annotated data. To address these challenges, we propose EndoMatcher, a generalizable endoscopic image matcher via large-scale, multi-domain data pre-training. To address difficult visual conditions, EndoMatcher employs a two-branch Vision Transformer to extract multi-scale features, enhanced by dual interaction blocks for robust correspondence learning. To overcome data scarcity and improve domain diversity, we construct Endo-Mix6, the first multi-domain dataset for endoscopic matching. Endo-Mix6 consists of approximately 1.2M real and synthetic image pairs across six domains, with…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
