ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient   Self-Supervised Monocular Depth Estimation

Daitao Xing; Jinglin Shen; Chiuman Ho; Anthony Tzes

arXiv:2212.05729·cs.CV·March 7, 2023

ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation

Daitao Xing, Jinglin Shen, Chiuman Ho, Anthony Tzes

PDF

Open Access

TL;DR

ROIFormer introduces a semantic-aware, local adaptive attention mechanism within a Transformer framework to enhance self-supervised monocular depth estimation, achieving state-of-the-art results on KITTI dataset.

Contribution

The paper proposes a novel local adaptive attention method guided by semantic cues, improving feature fusion and depth estimation efficiency in a Transformer-based model.

Findings

01

Achieves new state-of-the-art on KITTI for self-supervised monocular depth estimation.

02

Demonstrates faster convergence and more accurate depth predictions.

03

Effective semantic-aware local attention enhances feature discriminability.

Abstract

The exploration of mutual-benefit cross-domains has shown great potential toward accurate self-supervised depth estimation. In this work, we revisit feature fusion between depth and semantic information and propose an efficient local adaptive attention method for geometric aware representation enhancement. Instead of building global connections or deforming attention across the feature space without restraint, we bound the spatial interaction within a learnable region of interest. In particular, we leverage geometric cues from semantic information to learn local adaptive bounding boxes to guide unsupervised feature aggregation. The local areas preclude most irrelevant reference points from attention space, yielding more selective feature learning and faster convergence. We naturally extend the paradigm into a multi-head and hierarchic way to enable the information distillation in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Optical measurement and interference techniques

MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Layer Normalization · Dropout · Byte Pair Encoding · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Residual Connection