Multi-modal On-Device Learning for Monocular Depth Estimation on Ultra-low-power MCUs

Davide Nadalini; Manuele Rusci; Elia Cereda; Luca Benini; Francesco Conti; Daniele Palossi

arXiv:2512.00086·cs.CV·December 23, 2025

Multi-modal On-Device Learning for Monocular Depth Estimation on Ultra-low-power MCUs

Davide Nadalini, Manuele Rusci, Elia Cereda, Luca Benini, Francesco Conti, Daniele Palossi

PDF

Open Access 4 Datasets

TL;DR

This paper introduces a multi-modal on-device learning method for monocular depth estimation on ultra-low-power IoT devices, enabling real-time adaptation to new environments with minimal energy and memory use.

Contribution

It presents a novel on-device training scheme with a memory-efficient sparse update, allowing accurate depth estimation adaptation directly on IoT hardware.

Findings

01

Achieves 2% and 1.5% accuracy drops on KITTI and NYUv2 datasets.

02

Reduces RMSE from 4.9m to 0.6m in 17.8 minutes.

03

Uses only 3,000 self-labeled samples for effective in-field adaptation.

Abstract

Monocular depth estimation (MDE) plays a crucial role in enabling spatially-aware applications in Ultra-low-power (ULP) Internet-of-Things (IoT) platforms. However, the limited number of parameters of Deep Neural Networks for the MDE task, designed for IoT nodes, results in severe accuracy drops when the sensor data observed in the field shifts significantly from the training dataset. To address this domain shift problem, we present a multi-modal On-Device Learning (ODL) technique, deployed on an IoT device integrating a Greenwaves GAP9 MicroController Unit (MCU), a 80 mW monocular camera and a 8 x 8 pixel depth sensor, consuming $\approx$ 300mW. In its normal operation, this setup feeds a tiny 107 k-parameter $μ$ PyD-Net model with monocular images for inference. The depth sensor, usually deactivated to minimize energy consumption, is only activated alongside the camera to collect…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Advanced Optical Sensing Technologies · Robotics and Sensor-Based Localization