REMOTE: Real-time Ego-motion Tracking for Various Endoscopes via Multimodal Visual Feature Learning
Liangjing Shao, Benshuang Chen, Shuting Zhao, Xinrong Chen

TL;DR
This paper introduces a real-time ego-motion tracking framework for endoscopes using multimodal visual features, achieving high accuracy and speed suitable for robotic endoscopy navigation.
Contribution
It proposes a novel multi-modal feature learning network with an attention-based extractor and pose decoder for improved endoscope ego-motion estimation.
Findings
Outperforms state-of-the-art methods on multiple datasets.
Achieves over 30 frames per second inference speed.
Demonstrates robustness across various endoscopic scenes.
Abstract
Real-time ego-motion tracking for endoscope is a significant task for efficient navigation and robotic automation of endoscopy. In this paper, a novel framework is proposed to perform real-time ego-motion tracking for endoscope. Firstly, a multi-modal visual feature learning network is proposed to perform relative pose prediction, in which the motion feature from the optical flow, the scene features and the joint feature from two adjacent observations are all extracted for prediction. Due to more correlation information in the channel dimension of the concatenated image, a novel feature extractor is designed based on an attention mechanism to integrate multi-dimensional information from the concatenation of two continuous frames. To extract more complete feature representation from the fused features, a novel pose decoder is proposed to predict the pose transformation from the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAugmented Reality Applications
MethodsSoftmax · Attention Is All You Need · SPEED: Separable Pyramidal Pooling EncodEr-Decoder for Real-Time Monocular Depth Estimation on Low-Resource Settings
