HTMNet: A Hybrid Network with Transformer-Mamba Bottleneck Multimodal Fusion for Transparent and Reflective Objects Depth Completion

Guanghu Xie; Yonglong Zhang; Zhiduo Jiang; Yang Liu; Zongwu Xie; Baoshi Cao; Hong Liu

arXiv:2505.20904·cs.CV·May 29, 2025

HTMNet: A Hybrid Network with Transformer-Mamba Bottleneck Multimodal Fusion for Transparent and Reflective Objects Depth Completion

Guanghu Xie, Yonglong Zhang, Zhiduo Jiang, Yang Liu, Zongwu Xie, Baoshi Cao, Hong Liu

PDF

Open Access

TL;DR

HTMNet is a hybrid neural network that combines Transformer, CNN, and Mamba architectures to improve depth completion for transparent and reflective objects, enhancing robotic perception.

Contribution

The paper introduces the first application of Mamba architecture in transparent object depth completion, with a novel multimodal fusion module and multi-scale decoder for improved accuracy.

Findings

01

Achieves state-of-the-art performance on public datasets.

02

Effectively handles transparent and reflective object depth completion.

03

Demonstrates the potential of Mamba architecture in this domain.

Abstract

Transparent and reflective objects pose significant challenges for depth sensors, resulting in incomplete depth information that adversely affects downstream robotic perception and manipulation tasks. To address this issue, we propose HTMNet, a novel hybrid model integrating Transformer, CNN, and Mamba architectures. The encoder is based on a dual-branch CNN-Transformer framework, the bottleneck fusion module adopts a Transformer-Mamba architecture, and the decoder is built upon a multi-scale fusion module. We introduce a novel multimodal fusion module grounded in self-attention mechanisms and state space models, marking the first application of the Mamba architecture in the field of transparent object depth completion and revealing its promising potential. Additionally, we design an innovative multi-scale fusion module for the decoder that combines channel attention, spatial attention,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsImage Processing and 3D Reconstruction · Robotics and Automated Systems · Hand Gesture Recognition Systems

MethodsAttention Is All You Need · Linear Layer · Layer Normalization · Byte Pair Encoding · Residual Connection · Dense Connections · Softmax · Position-Wise Feed-Forward Layer · Absolute Position Encodings · Mamba: Linear-Time Sequence Modeling with Selective State Spaces