MAMo: Leveraging Memory and Attention for Monocular Video Depth   Estimation

Rajeev Yasarla; Hong Cai; Jisoo Jeong; Yunxiao Shi; Risheek; Garrepalli; Fatih Porikli

arXiv:2307.14336·cs.CV·January 17, 2025

MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation

Rajeev Yasarla, Hong Cai, Jisoo Jeong, Yunxiao Shi, Risheek, Garrepalli, Fatih Porikli

PDF

Open Access 1 Video

TL;DR

MAMo introduces a memory and attention framework that enhances monocular video depth estimation by leveraging temporal information, leading to more accurate and efficient depth predictions across various benchmarks.

Contribution

The paper presents a novel memory and attention-based approach that can augment existing single-image depth networks for improved video depth estimation.

Findings

01

Achieves state-of-the-art accuracy on KITTI, NYU-Depth V2, and DDAD benchmarks.

02

Improves depth estimation accuracy with lower latency compared to cost-volume methods.

03

Effectively leverages temporal information through memory and attention mechanisms.

Abstract

We propose MAMo, a novel memory and attention frame-work for monocular video depth estimation. MAMo can augment and improve any single-image depth estimation networks into video depth estimation models, enabling them to take advantage of the temporal information to predict more accurate depth. In MAMo, we augment model with memory which aids the depth prediction as the model streams through the video. Specifically, the memory stores learned visual and displacement tokens of the previous time instances. This allows the depth network to cross-reference relevant features from the past when predicting depth on the current frame. We introduce a novel scheme to continuously update the memory, optimizing it to keep tokens that correspond with both the past and the present visual information. We adopt attention-based approach to process memory features where we first learn the spatio-temporal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

MAMo: Leveraging Memory and Attention for Monocular Video Depth Estimation· youtube

Taxonomy

TopicsAdvanced Vision and Imaging · Image Processing Techniques and Applications · Advanced Image Processing Techniques