Forecasting of depth and ego-motion with transformers and self-supervision
Houssem Boulahbal, Adrian Voicila, Andrew Comport

TL;DR
This paper introduces a self-supervised method combining CNNs and transformers to forecast depth and ego-motion from raw image sequences, achieving competitive results without requiring annotated data.
Contribution
It proposes a novel architecture that leverages both convolutional and transformer modules for self-supervised depth and ego-motion forecasting from raw images.
Findings
Performs well on KITTI benchmark
Achieves results comparable to supervised methods
Uses only raw images without annotations
Abstract
This paper addresses the problem of end-to-end self-supervised forecasting of depth and ego motion. Given a sequence of raw images, the aim is to forecast both the geometry and ego-motion using a self supervised photometric loss. The architecture is designed using both convolution and transformer modules. This leverages the benefits of both modules: Inductive bias of CNN, and the multi-head attention of transformers, thus enabling a rich spatio-temporal representation that enables accurate depth forecasting. Prior work attempts to solve this problem using multi-modal input/output with supervised ground-truth data which is not practical since a large annotated dataset is required. Alternatively to prior methods, this paper forecasts depth and ego motion using only self-supervised raw images as input. The approach performs significantly well on the KITTI dataset benchmark with several…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Optical measurement and interference techniques · Image Processing Techniques and Applications
MethodsSoftmax · Linear Layer · Convolution
