HiMODE: A Hybrid Monocular Omnidirectional Depth Estimation Model
Masum Shah Junayed, Arezoo Sadeghzadeh, Md Baharul Islam, Lai-Kuan, Wong, Tarkan Aydin

TL;DR
HiMODE is a novel hybrid CNN-Transformer model for monocular omnidirectional depth estimation, effectively capturing details and reducing distortion with state-of-the-art results on multiple datasets.
Contribution
The paper introduces HiMODE, a hybrid CNN-Transformer architecture with novel modules for improved 360° depth estimation from monocular images.
Findings
Achieves state-of-the-art performance on Stanford3D, Matterport3D, and SunCG datasets.
Effectively captures small object details and reduces distortion.
Demonstrates the importance of each module through ablation studies.
Abstract
Monocular omnidirectional depth estimation is receiving considerable research attention due to its broad applications for sensing 360{\deg} surroundings. Existing approaches in this field suffer from limitations in recovering small object details and data lost during the ground-truth depth map acquisition. In this paper, a novel monocular omnidirectional depth estimation model, namely HiMODE is proposed based on a hybrid CNN+Transformer (encoder-decoder) architecture whose modules are efficiently designed to mitigate distortion and computational cost, without performance degradation. Firstly, we design a feature pyramid network based on the HNet block to extract high-resolution features near the edges. The performance is further improved, benefiting from a self and cross attention layer and spatial/temporal patches in the Transformer encoder and decoder, respectively. Besides, a spatial…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Optical measurement and interference techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Byte Pair Encoding · Position-Wise Feed-Forward Layer · Dense Connections · *Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Layer Normalization · Absolute Position Encodings
