FutureDepth: Learning to Predict the Future Improves Video Depth Estimation
Rajeev Yasarla, Manish Kumar Singh, Hong Cai, Yunxiao Shi, Jisoo, Jeong, Yinhao Zhu, Shizhong Han, Risheek Garrepalli, Fatih Porikli

TL;DR
FutureDepth introduces a novel approach to video depth estimation by learning to predict future frames, leveraging multi-frame and motion cues, resulting in state-of-the-art accuracy across diverse benchmarks.
Contribution
The paper proposes FutureDepth, a new method that incorporates future prediction and multi-frame correspondence learning to enhance video depth estimation performance.
Findings
Significantly outperforms existing methods on NYUDv2, KITTI, DDAD, and Sintel benchmarks.
Achieves state-of-the-art accuracy in video depth estimation.
Maintains efficiency comparable to monocular models.
Abstract
In this paper, we propose a novel video depth estimation approach, FutureDepth, which enables the model to implicitly leverage multi-frame and motion cues to improve depth estimation by making it learn to predict the future at training. More specifically, we propose a future prediction network, F-Net, which takes the features of multiple consecutive frames and is trained to predict multi-frame features one time step ahead iteratively. In this way, F-Net learns the underlying motion and correspondence information, and we incorporate its features into the depth decoding process. Additionally, to enrich the learning of multiframe correspondence cues, we further leverage a reconstruction network, R-Net, which is trained via adaptively masked auto-encoding of multiframe feature volumes. At inference time, both F-Net and R-Net are used to produce queries to work with the depth decoder, as…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Generative Adversarial Networks and Image Synthesis · Advanced Image and Video Retrieval Techniques
