HTMNet: A Hybrid Network with Transformer-Mamba Bottleneck Multimodal Fusion for Transparent and Reflective Objects Depth Completion
Guanghu Xie, Yonglong Zhang, Zhiduo Jiang, Yang Liu, Zongwu Xie, Baoshi Cao, Hong Liu

TL;DR
HTMNet is a hybrid neural network that combines Transformer, CNN, and Mamba architectures to improve depth completion for transparent and reflective objects, enhancing robotic perception.
Contribution
The paper introduces the first application of Mamba architecture in transparent object depth completion, with a novel multimodal fusion module and multi-scale decoder for improved accuracy.
Findings
Achieves state-of-the-art performance on public datasets.
Effectively handles transparent and reflective object depth completion.
Demonstrates the potential of Mamba architecture in this domain.
Abstract
Transparent and reflective objects pose significant challenges for depth sensors, resulting in incomplete depth information that adversely affects downstream robotic perception and manipulation tasks. To address this issue, we propose HTMNet, a novel hybrid model integrating Transformer, CNN, and Mamba architectures. The encoder is based on a dual-branch CNN-Transformer framework, the bottleneck fusion module adopts a Transformer-Mamba architecture, and the decoder is built upon a multi-scale fusion module. We introduce a novel multimodal fusion module grounded in self-attention mechanisms and state space models, marking the first application of the Mamba architecture in the field of transparent object depth completion and revealing its promising potential. Additionally, we design an innovative multi-scale fusion module for the decoder that combines channel attention, spatial attention,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsImage Processing and 3D Reconstruction · Robotics and Automated Systems · Hand Gesture Recognition Systems
MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Mamba: Linear-Time Sequence Modeling with Selective State Spaces
