GSDC Transformer: An Efficient and Effective Cue Fusion for Monocular Multi-Frame Depth Estimation
Naiyu Fang, Lemiao Qiu, Shuyou Zhang, Zili Wang, Zheyuan Zhou, Kerui, Hu

TL;DR
The GSDC Transformer introduces a novel, efficient cue fusion method for monocular multi-frame depth estimation, combining deformable and sparse attention to improve accuracy and speed in dynamic and static scenes.
Contribution
It proposes a deformable and sparse attention-based cue fusion approach that enhances depth estimation efficiency and accuracy without heavy segmentation reliance.
Findings
Achieves state-of-the-art results on KITTI dataset.
Provides faster cue fusion compared to existing methods.
Effectively handles dynamic scenes with scene attribute super tokens.
Abstract
Depth estimation provides an alternative approach for perceiving 3D information in autonomous driving. Monocular depth estimation, whether with single-frame or multi-frame inputs, has achieved significant success by learning various types of cues and specializing in either static or dynamic scenes. Recently, these cues fusion becomes an attractive topic, aiming to enable the combined cues to perform well in both types of scenes. However, adaptive cue fusion relies on attention mechanisms, where the quadratic complexity limits the granularity of cue representation. Additionally, explicit cue fusion depends on precise segmentation, which imposes a heavy burden on mask prediction. To address these issues, we propose the GSDC Transformer, an efficient and effective component for cue fusion in monocular multi-frame depth estimation. We utilize deformable attention to learn cue relationships…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Advanced Image Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Dropout · Byte Pair Encoding · Label Smoothing · Absolute Position Encodings · Adam · Softmax
