EndoMatcher: Generalizable Endoscopic Image Matcher via Multi-Domain Pre-training for Robot-Assisted Surgery

Bingyu Yang; Qingyao Tian; Yimeng Geng; Huai Liao; Xinyan Huang; Jiebo Luo; and Hongbin Liu

arXiv:2508.05205·cs.CV·August 8, 2025

EndoMatcher: Generalizable Endoscopic Image Matcher via Multi-Domain Pre-training for Robot-Assisted Surgery

Bingyu Yang, Qingyao Tian, Yimeng Geng, Huai Liao, Xinyan Huang, Jiebo Luo, and Hongbin Liu

PDF

TL;DR

EndoMatcher is a novel endoscopic image matching method that uses multi-domain pre-training and a two-branch Vision Transformer to achieve robust, zero-shot performance across diverse challenging conditions in robot-assisted surgery.

Contribution

It introduces EndoMatcher, a new generalizable endoscopic image matcher trained on a large multi-domain dataset with a progressive multi-objective strategy for improved robustness.

Findings

01

Increases inlier matches by over 140% on benchmark datasets.

02

Achieves 9.4% higher MDPA on Gastro-Matching.

03

Demonstrates strong zero-shot generalization across unseen organs and conditions.

Abstract

Generalizable dense feature matching in endoscopic images is crucial for robot-assisted tasks, including 3D reconstruction, navigation, and surgical scene understanding. Yet, it remains a challenge due to difficult visual conditions (e.g., weak textures, large viewpoint variations) and a scarcity of annotated data. To address these challenges, we propose EndoMatcher, a generalizable endoscopic image matcher via large-scale, multi-domain data pre-training. To address difficult visual conditions, EndoMatcher employs a two-branch Vision Transformer to extract multi-scale features, enhanced by dual interaction blocks for robust correspondence learning. To overcome data scarcity and improve domain diversity, we construct Endo-Mix6, the first multi-domain dataset for endoscopic matching. Endo-Mix6 consists of approximately 1.2M real and synthetic image pairs across six domains, with…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.