REMOTE: Real-time Ego-motion Tracking for Various Endoscopes via   Multimodal Visual Feature Learning

Liangjing Shao; Benshuang Chen; Shuting Zhao; Xinrong Chen

arXiv:2501.18124·cs.CV·February 4, 2025

REMOTE: Real-time Ego-motion Tracking for Various Endoscopes via Multimodal Visual Feature Learning

Liangjing Shao, Benshuang Chen, Shuting Zhao, Xinrong Chen

PDF

Open Access

TL;DR

This paper introduces a real-time ego-motion tracking framework for endoscopes using multimodal visual features, achieving high accuracy and speed suitable for robotic endoscopy navigation.

Contribution

It proposes a novel multi-modal feature learning network with an attention-based extractor and pose decoder for improved endoscope ego-motion estimation.

Findings

01

Outperforms state-of-the-art methods on multiple datasets.

02

Achieves over 30 frames per second inference speed.

03

Demonstrates robustness across various endoscopic scenes.

Abstract

Real-time ego-motion tracking for endoscope is a significant task for efficient navigation and robotic automation of endoscopy. In this paper, a novel framework is proposed to perform real-time ego-motion tracking for endoscope. Firstly, a multi-modal visual feature learning network is proposed to perform relative pose prediction, in which the motion feature from the optical flow, the scene features and the joint feature from two adjacent observations are all extracted for prediction. Due to more correlation information in the channel dimension of the concatenated image, a novel feature extractor is designed based on an attention mechanism to integrate multi-dimensional information from the concatenation of two continuous frames. To extract more complete feature representation from the fused features, a novel pose decoder is proposed to predict the pose transformation from the…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAugmented Reality Applications

MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings