BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation

Zhenyu Li; Xuyang Wang; Xianming Liu; Junjun Jiang

arXiv:2204.00987·cs.CV·April 5, 2022·72 cites

BinsFormer: Revisiting Adaptive Bins for Monocular Depth Estimation

Zhenyu Li, Xuyang Wang, Xianming Liu, Junjun Jiang

PDF

Open Access 2 Repos

TL;DR

BinsFormer introduces a novel Transformer-based framework for monocular depth estimation that adaptively generates bins and enhances spatial understanding, achieving state-of-the-art results on multiple datasets.

Contribution

The paper proposes a new method using Transformer decoders for adaptive bin generation and multi-scale decoding for improved depth estimation accuracy.

Findings

01

Outperforms existing methods on KITTI, NYU, and SUN RGB-D datasets.

02

Uses a novel set-to-set prediction approach for bin generation.

03

Incorporates scene understanding queries to enhance depth accuracy.

Abstract

Monocular depth estimation is a fundamental task in computer vision and has drawn increasing attention. Recently, some methods reformulate it as a classification-regression task to boost the model performance, where continuous depth is estimated via a linear combination of predicted probability distributions and discrete bins. In this paper, we present a novel framework called BinsFormer, tailored for the classification-regression-based depth estimation. It mainly focuses on two crucial components in the specific task: 1) proper generation of adaptive bins and 2) sufficient interaction between probability distribution and bins predictions. To specify, we employ the Transformer decoder to generate bins, novelly viewing it as a direct set-to-set prediction problem. We further integrate a multi-scale decoder structure to achieve a comprehensive understanding of spatial geometry information…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Advanced Image and Video Retrieval Techniques

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Absolute Position Encodings · Dropout · Softmax · Layer Normalization · Label Smoothing · Byte Pair Encoding · Position-Wise Feed-Forward Layer