ROIFormer: Semantic-Aware Region of Interest Transformer for Efficient Self-Supervised Monocular Depth Estimation
Daitao Xing, Jinglin Shen, Chiuman Ho, Anthony Tzes

TL;DR
ROIFormer introduces a semantic-aware, local adaptive attention mechanism within a Transformer framework to enhance self-supervised monocular depth estimation, achieving state-of-the-art results on KITTI dataset.
Contribution
The paper proposes a novel local adaptive attention method guided by semantic cues, improving feature fusion and depth estimation efficiency in a Transformer-based model.
Findings
Achieves new state-of-the-art on KITTI for self-supervised monocular depth estimation.
Demonstrates faster convergence and more accurate depth predictions.
Effective semantic-aware local attention enhances feature discriminability.
Abstract
The exploration of mutual-benefit cross-domains has shown great potential toward accurate self-supervised depth estimation. In this work, we revisit feature fusion between depth and semantic information and propose an efficient local adaptive attention method for geometric aware representation enhancement. Instead of building global connections or deforming attention across the feature space without restraint, we bound the spatial interaction within a learnable region of interest. In particular, we leverage geometric cues from semantic information to learn local adaptive bounding boxes to guide unsupervised feature aggregation. The local areas preclude most irrelevant reference points from attention space, yielding more selective feature learning and faster convergence. We naturally extend the paradigm into a multi-head and hierarchic way to enable the information distillation in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Optical measurement and interference techniques
MethodsMulti-Head Attention · Attention Is All You Need · Label Smoothing · Layer Normalization · Dropout · Byte Pair Encoding · Linear Layer · Dense Connections · Position-Wise Feed-Forward Layer · Residual Connection
