GSDC Transformer: An Efficient and Effective Cue Fusion for Monocular   Multi-Frame Depth Estimation

Naiyu Fang; Lemiao Qiu; Shuyou Zhang; Zili Wang; Zheyuan Zhou; Kerui; Hu

arXiv:2309.17059·cs.CV·December 6, 2023

GSDC Transformer: An Efficient and Effective Cue Fusion for Monocular Multi-Frame Depth Estimation

Naiyu Fang, Lemiao Qiu, Shuyou Zhang, Zili Wang, Zheyuan Zhou, Kerui, Hu

PDF

Open Access

TL;DR

The GSDC Transformer introduces a novel, efficient cue fusion method for monocular multi-frame depth estimation, combining deformable and sparse attention to improve accuracy and speed in dynamic and static scenes.

Contribution

It proposes a deformable and sparse attention-based cue fusion approach that enhances depth estimation efficiency and accuracy without heavy segmentation reliance.

Findings

01

Achieves state-of-the-art results on KITTI dataset.

02

Provides faster cue fusion compared to existing methods.

03

Effectively handles dynamic scenes with scene attribute super tokens.

Abstract

Depth estimation provides an alternative approach for perceiving 3D information in autonomous driving. Monocular depth estimation, whether with single-frame or multi-frame inputs, has achieved significant success by learning various types of cues and specializing in either static or dynamic scenes. Recently, these cues fusion becomes an attractive topic, aiming to enable the combined cues to perform well in both types of scenes. However, adaptive cue fusion relies on attention mechanisms, where the quadratic complexity limits the granularity of cue representation. Additionally, explicit cue fusion depends on precise segmentation, which imposes a heavy burden on mask prediction. To address these issues, we propose the GSDC Transformer, an efficient and effective component for cue fusion in monocular multi-frame depth estimation. We utilize deformable attention to learn cue relationships…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Advanced Image Processing Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Dense Connections · Linear Layer · Dropout · Byte Pair Encoding · Label Smoothing · Absolute Position Encodings · Adam · Softmax