ER-LoRA: Effective-Rank Guided Adaptation for Weather-Generalized Depth Estimation
Weilong Yan, Xin Zhang, Robby T. Tan

TL;DR
This paper introduces ER-LoRA, a PEFT approach using effective-rank guided adaptation for weather-generalized depth estimation, leveraging pretrained vision models with minimal high-visibility data to outperform existing methods.
Contribution
The paper proposes a novel STM strategy for PEFT that decomposes pretrained weights based on effective ranks, enabling flexible adaptation and strong generalization in depth estimation under adverse weather.
Findings
STM outperforms existing PEFT and full fine-tuning methods.
Our approach surpasses synthetic data trained methods.
Achieves state-of-the-art results across multiple weather conditions.
Abstract
Monocular depth estimation under adverse weather conditions (e.g.\ rain, fog, snow, and nighttime) remains highly challenging due to the lack of reliable ground truth and the difficulty of learning from unlabeled real-world data. Existing methods often rely on synthetic adverse data with pseudo-labels, which suffer from domain gaps, or employ self-supervised learning, which violates photometric assumptions in adverse scenarios. In this work, we propose to achieve weather-generalized depth estimation by Parameter-Efficient Fine-Tuning (PEFT) of Vision Foundation Models (VFMs), using only a small amount of high-visibility (normal) data. While PEFT has shown strong performance in semantic tasks such as segmentation, it remains underexplored for geometry -- centric tasks like depth estimation -- especially in terms of balancing effective adaptation with the preservation of pretrained…
| Method | Backbone | Trainable | nuScenes-day | nuScenes-night | nuScenes-rain | |||||||||
| Params* | AbsRel | SqRel | RMSE | AbsRel | SqRel | RMSE | AbsRel | SqRel | RMSE | |||||
| Monodepth2 godard2019digging | ResNet | 11.7M | 13.33 | 1.820 | 6.459 | 85.88 | 24.19 | 2.776 | 10.922 | 58.17 | 15.72 | 2.273 | 7.453 | 79.49 |
| RNW wang2021RNW | ResNet | 11.7M | 28.72 | 3.433 | 9.185 | 56.21 | 33.33 | 4.066 | 10.098 | 43.72 | 29.52 | 3.796 | 9.341 | 57.21 |
| robust-depth Saunders_2023_ICCV | ResNet | 11.7M | 14.36 | 1.862 | 6.802 | 82.95 | 21.01 | 2.691 | 8.673 | 69.58 | 14.58 | 1.891 | 7.371 | 80.21 |
| md4all gasperini_morbitzer2023md4all | ResNet | 11.7M | 13.66 | 1.752 | 6.452 | 84.61 | 19.21 | 2.386 | 8.507 | 71.07 | 14.14 | 1.829 | 7.228 | 80.98 |
| DM-MDE diffusionadverse | ResNet | 11.7M | 12.80 | - | 6.449 | 84.03 | 19.10 | - | 8.433 | 71.14 | 13.90 | - | 7.129 | 81.36 |
| FFT | DINOv2 | 304.2M | 11.87 | 1.518 | 6.117 | 88.39 | 18.73 | 2.459 | 8.242 | 74.26 | 13.69 | 1.709 | 6.725 | 84.03 |
| FFT+md4all | DINOv2 | 304.2M | 11.55 | 1.412 | 6.060 | 88.54 | 18.69 | 2.593 | 8.391 | 73.99 | 12.80 | 1.675 | 6.639 | 84.98 |
| Freeze | DINOv2 | 0.0M | 12.00 | 1.502 | 6.086 | 87.49 | 17.22 | 1.883 | 8.051 | 72.19 | 13.37 | 1.668 | 6.580 | 84.27 |
| Rein Rein | DINOv2 | 5.0M | 12.09 | 1.475 | 6.167 | 87.55 | 17.23 | 1.896 | 7.711 | 73.52 | 13.66 | 1.707 | 6.673 | 84.48 |
| LoRA hu2022lora | DINOv2 | 7.0M | 11.40 | 1.247 | 5.875 | 88.32 | 17.57 | 1.978 | 7.730 | 74.40 | 13.26 | 1.572 | 6.524 | 84.61 |
| SoMA yun2024soma | DINOv2 | 5.3M | 11.34 | 1.422 | 6.018 | 88.91 | 17.88 | 1.960 | 7.732 | 74.53 | 13.05 | 1.705 | 6.545 | 84.77 |
| Ours | DINOv2 | 8.7M | 10.78 | 1.256 | 5.797 | 89.18 | 16.75 | 1.772 | 7.465 | 75.23 | 12.40 | 1.532 | 6.404 | 85.40 |
| Method | RC-day | RC-night | DS-rain | DS-fog | CADC-Snow | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AbsRel | RMSE | AbsRel | RMSE | AbsRel | RMSE | AbsRel | RMSE | AbsRel | RMSE | ||||||
| robust-depth* Saunders_2023_ICCV | 12.25 | 3.302 | 85.84 | 13.33 | 3.756 | 85.11 | 26.22 | 11.657 | 59.58 | 13.66 | 6.876 | 84.01 | 31.47 | 13.65 | 43.37 |
| md4all* gasperini_morbitzer2023md4all | 11.28 | 3.206 | 87.13 | 12.19 | 3.604 | 84.86 | 18.22 | 8.465 | 70.35 | 12.42 | 6.269 | 86.16 | 29.65 | 12.85 | 48.59 |
| DM-MDE* diffusionadverse | 11.90 | 3.287 | 87.17 | 12.90 | 3.661 | 83.68 | - | - | - | - | - | - | - | - | - |
| FFT | 10.36 | 3.099 | 89.20 | 12.31 | 3.667 | 87.53 | 10.57 | 5.262 | 90.76 | 8.69 | 4.839 | 94.02 | 25.76 | 10.44 | 65.93 |
| FFT+md4all | 10.08 | 3.118 | 89.15 | 12.70 | 3.973 | 84.37 | 11.08 | 5.394 | 89.15 | 8.71 | 4.662 | 93.99 | 26.24 | 10.49 | 65.10 |
| Freeze | 9.54 | 2.906 | 90.03 | 11.74 | 3.687 | 85.64 | 10.63 | 5.476 | 90.91 | 8.50 | 4.740 | 93.92 | 27.23 | 10.37 | 61.98 |
| Rein Rein | 9.62 | 2.986 | 90.14 | 12.13 | 3.766 | 85.87 | 11.42 | 5.739 | 88.98 | 8.55 | 4.987 | 93.69 | 26.80 | 10.81 | 64.51 |
| LoRA hu2022lora | 9.52 | 3.018 | 90.31 | 11.53 | 3.458 | 88.07 | 11.17 | 5.545 | 90.20 | 8.89 | 4.811 | 93.86 | 26.07 | 10.42 | 65.11 |
| SoMA yun2024soma | 9.39 | 2.920 | 90.73 | 11.43 | 3.401 | 88.28 | 11.60 | 6.079 | 88.90 | 9.67 | 5.342 | 92.94 | 26.10 | 10.62 | 66.11 |
| Ours | 9.41 | 2.917 | 90.57 | 11.12 | 3.393 | 88.81 | 10.49 | 5.444 | 91.45 | 8.62 | 4.659 | 94.42 | 24.36 | 10.08 | 67.71 |
| Method | NS-night | NS-rain | RC-night | DS-rain | DS-fog | CADC-snow | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AbsRel | RMSE | AbsRel | RMSE | AbsRel | RMSE | AbsRel | RMSE | AbsRel | RMSE | AbsRel | RMSE | |||||||
| md4all gasperini_morbitzer2023md4all | 18.21 | 6.372 | 75.33 | 15.62 | 5.903 | 82.82 | - | - | - | - | - | - | - | - | - | - | - | - |
| FFT | 14.88 | 7.214 | 78.37 | 8.55 | 5.290 | 90.13 | 8.74 | 2.893 | 90.53 | 8.69 | 4.889 | 93.15 | 6.18 | 4.096 | 95.31 | 22.61 | 10.44 | 66.21 |
| Freeze | 17.57 | 8.241 | 73.20 | 9.60 | 5.520 | 87.91 | 8.99 | 2.908 | 89.71 | 9.30 | 5.008 | 92.07 | 6.76 | 4.353 | 94.38 | 21.90 | 9.973 | 68.13 |
| Rein Rein | 18.05 | 8.424 | 72.53 | 9.75 | 5.615 | 87.51 | 9.25 | 2.891 | 89.56 | 9.22 | 5.009 | 92.33 | 6.65 | 4.367 | 94.47 | 22.34 | 10.25 | 66.22 |
| LoRA hu2022lora | 15.38 | 7.379 | 77.74 | 8.66 | 5.341 | 89.89 | 8.87 | 2.914 | 90.33 | 8.07 | 4.459 | 94.32 | 5.99 | 3.855 | 95.97 | 21.97 | 10.07 | 66.33 |
| DA v2 depth_anything_v2 | 17.64 | 8.424 | 74.87 | 12.19 | 7.217 | 84.51 | 8.60 | 3.150 | 91.66 | 9.04 | 5.030 | 92.89 | 6.88 | 4.552 | 95.43 | 22.53 | 10.78 | 72.97 |
| Ours | 14.82 | 7.258 | 78.57 | 8.45 | 5.218 | 90.24 | 8.85 | 2.893 | 90.63 | 7.81 | 4.349 | 94.90 | 5.71 | 3.729 | 96.30 | 21.73 | 10.10 | 66.37 |
| Modules | nuScenes-night | nuScenes-rain | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| AbsRel | SqRel | RMSE | AbsRel | SqRel | RMSE | |||||
| ✓ | 17.21 | 2.087 | 7.693 | 74.76 | 12.93 | 1.633 | 6.485 | 84.86 | ||
| ✓ | ✓ | 16.96 | 1.942 | 7.604 | 75.06 | 12.58 | 1.596 | 6.438 | 85.12 | |
| ✓ | ✓ | ✓ | 16.75 | 1.772 | 7.465 | 75.23 | 12.40 | 1.532 | 6.404 | 85.40 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Domain Adaptation and Few-Shot Learning · Human Pose and Action Recognition
ER-LoRA: Effective-Rank Guided Adaptation for Weather-Generalized Depth Estimation
Yan Weilong Zhang Xin Robby T. Tan
National University of Singapore
Abstract
Monocular depth estimation under adverse weather conditions (e.g., rain, fog, snow, and nighttime) remains highly challenging due to the lack of reliable ground truth and the difficulty of learning from unlabeled real-world data. Existing methods often rely on synthetic adverse data with pseudo-labels, which suffer from domain gaps, or employ self-supervised learning, which violates photometric assumptions in adverse scenarios. In this work, we propose to achieve weather-generalized depth estimation by Parameter-Efficient Fine-Tuning (PEFT) of Vision Foundation Models (VFMs), using only a small amount of high-visibility (normal) data. While PEFT has shown strong performance in semantic tasks such as segmentation, it remains underexplored for geometry-centric tasks like depth estimation—especially in terms of balancing effective adaptation with the preservation of pretrained knowledge. To this end, we introduce the Selecting-Tuning-Maintaining (STM) strategy, which structurally decomposes the pretrained weights of VFMs based on two kinds of effective ranks (entropy-rank and stable-rank). In the tuning phase, we adaptively select the proper rank number as well as the task-aware singular directions for initialization, based on the entropy-rank and full-tuned weight; while in the maintaining stage, we enforce a principal direction regularization based on the stable-rank. This design guarantees flexible task adaptation while preserving the strong generalization capability of the pretrained VFM. Extensive experiments on four real-world benchmarks across diverse weather conditions demonstrate that STM not only outperforms existing PEFT methods and full fine-tuning but also surpasses methods trained with adverse synthetic data, and even the depth foundation model, under both supervised and self-supervised settings.
1 Introduction
Monocular Depth Estimation (MDE) is a core task in computer vision, supporting a wide range of applications including robotics yu2023udepth , autonomous driving zheng2024physical ; deep-homo , medical tasks wang2025monopcc ; pvchat , reconstruction ssnerf , and scene understanding schon2021mgnet ; 3dswapping . Recent advances have largely centered on self-supervised MDE 2017left-right-consistency ; godard2019digging ; litemono ; zhao2022monovit ; bian2021ijcv_scdepth ; 2023planedepth ; 2021epcdepth ; manydepth , which avoid the need for ground-truth depth by enforcing photometric consistency across image pairs. While effective in favorable environments, such methods often fail under adverse conditions (e.g., nighttime, fog, or rain), where the photometric assumption breaks down. Another line of work focuses on building depth foundation models depth_anything_v2 ; depthanything ; marigold ; gui2024depthfm ; zoedepth ; birkl2023midas ; shao2023IEBins via supervised learning on large-scale datasets, aiming to enhance cross-domain generalization. However, their performance is limited by sparse and noisy annotations, which are common in outdoor adverse settings—where ground-truth depth (e.g., LiDAR) often covers only a fraction of the scene and suffers from artifacts like rain-induced reflections gasperini_morbitzer2023md4all ; nuscenes2019 . Consequently, the inability to extract reliable supervision from unlabeled adverse data, combined with the inefficiency of relying on imperfect annotations, renders robust monocular depth estimation (RMDE) under diverse real-world conditions an unsolved and practically important problem.
To improve robustness under adverse conditions, several RMDE methods wang2021RNW ; diffusionadverse ; gasperini_morbitzer2023md4all ; Saunders_2023_ICCV ; wang2023weatherdepth ; diffusion_contrast ; syn2real-depth adopt unsupervised domain adaptation (UDA) frameworks to avoid reliance on real-world depth labels. Early efforts vankadari2020ADFA ; wang2021RNW ; spencer2020defeat ; 23ICRA_Steps target a single adverse scenario, such as nighttime, but often fail to generalize beyond their specific conditions. Subsequent works liu2021ADIDS ; vankadari2023sun attempt to bridge daytime and nighttime domains, yet struggle to handle the broader diversity of real-world environments. More recent approaches gasperini_morbitzer2023md4all ; Saunders_2023_ICCV ; diffusionadverse ; wang2023weatherdepth ; syn2real-depth aim to improve generalization across diverse weather conditions by leveraging generative augmentation CycleGAN2017 ; zheng_2020_forkgan ; LDM ; controlnet or synthetic-to-real domain adaptation. However, their reliance on synthetic data and limited exposure to diverse target domains hinders generalization to unseen real-world conditions. These trends raise a critical question: Is synthetic data truly necessary to achieve robust depth estimation under adverse conditions?
Currently, Vision Foundation Models (VFMs) MAE ; clip ; EVA ; oquab2023dinov2 have emerged as a powerful paradigm, demonstrating strong generalization across tasks and domains through large-scale self-supervised pretraining on natural images. Recent works have dug into parameter-efficient fine-tuning (PEFT) of VFMs for tasks under challenging scenarios such as domain-generalized semantic segmentation Rein ; yun2024soma ; bi2024fada ; mfuser and domain-generalized object detection yun2024soma ; chen2022adaptformer , achieving promising results. Motivated by their robustness and transferability, we naturally explore whether VFMs can be efficiently adapted to RMDE, particularly under unseen adverse conditions. However, existing PEFT methods hu2022lora ; yun2024soma ; miLoRA ; task_specific_direction ; Rein ; chen2022adaptformer are mainly designed for semantic tasks due to the semantic nature of VFM. What’s more, they overlook the full singular spectrum, have not considered critical direction protection, and underutilize the rich directional information in the frozen weight matrices. On the contrary, RMDE relies on the low-level features instead of semantic information, which leads to sub-optimal results by simply utilizing previous approaches.
In this work, we present ER-LoRA, a novel PEFT pipeline for robust monocular depth estimation, based on the guidance of effective rank. We begin by analyzing the frozen weight matrices of VFMs oquab2023dinov2 via singular value decomposition (SVD), revealing the distribution of singular vectors across different singular values; also, we find how these weights have changed through full fine-tuning, with the structure similarity between the frozen weight and the residual weight. Based on these insights, we design a Selecting, Tuning, and Maintaining (STM) strategy, aiming to adapt the semantically biased representations of VFMs toward low-level geometric reasoning while retaining their inherent robustness and generalization. Guided by the definition of effective ranks (e-ranks) entropyrank ; stablerank3 ; stablerank1 ; stablerank2 , we adaptively select proper rank numbers for different weight matrices, as well as finding the most task-relevant directions. This helps with better initialization. In the tuning phase, we only focus on training the low-rank branch; in the maitaining phase, we apply a component regularization to minimally perturb critical directions. Compared to prior PEFT methods, our approach provides a more flexible task adaptation while preserving the strong generalization capability of the pretrained VFM, which is helpful for adapting strong VFMs to tasks that are semantics-agnostic.
In our experiments, we conduct evaluations in both self-supervised and supervised settings for RMDE, through training only on daytime data, and evaluate through multiple real-world datasets nuscenes2019 ; robotcar ; drivingstereo ; cadc of diverse adverse and normal conditions (e.g., daytime, nighttime, rain, fog, snow) in a zero-shot manner. The experimental results of both settings show that our method surpasses previous synthetic-data-based methods, PEFT approaches, full fine-tuning (FFT), and the MDE foundation model. To conclude, our contributions can be listed as follows:
- •
To our knowledge, we are the first to propose learning weather-generalized RMDE via Parameter-Efficient Fine-Tuning of VFMs. We design ER-LoRA, a novel PEFT pipeline for RMDE, which eliminates the need for synthetic adverse data used in prior methods.
- •
We leverage the definition of effective ranks to propose a novel Selecting, Tuning, and Maintaining (STM) strategy for PEFT, which gives flexible tuning space while retaining the pre-trained generalization ability and robustness compared with existing methods.
- •
Experimental results under both self-supervised and supervised settings across various real-world datasets of diverse conditions demonstrate our method outperforms existing synthetic-data-based methods, PEFT methods, full fine-tuning and MDE foundation models, with an average AbsRel enhancement of 7.3% with FFT, 6.4% with PEFT methods.
2 Related Work
Monocular Depth Estimation aims to predict a dense depth map from a single image, and has mainly evolved along two paradigms: supervised and self-supervised methods. The supervised methods recently tries to build up fundamental MDE models by leveraging strong backbones, such as convolutional neural networks Eigen2014 ; yuan2022newcrfs ; depth_resnet2016 , transformers li2022depthformer ; DPT2021 ; transdepth , and generative models marigold ; gui2024depthfm ; fu2024geowizard ; hu2024metric3d ; or digging into data enhancement and scaling up birkl2023midas ; zoedepth ; depth_anything_v2 ; depthanything by mixing various datasets, from synthetic to real-world, and from labeled to unlabeled data. Also, some works explore the introduction of language prior into the MDE pipelines VPD ; zeng2024wordepthvariationallanguageprior ; ecodepth ; languageguidance , and some research DAR designs a depth autogressive model based on recent success in AR models.
In contrast, self-supervised methods do not need any ground truth for supervision, which is based on the assumption of photometric consistency 2017left-right-consistency ; 2017cvpr_egomotion . 2017cvpr_egomotion first utilizes photometric loss in self-supervised depth, while the following works godard2019digging ; sc_depthv3 ; bian2021ijcv_scdepth design different strategies to solve problems with moving objects and scale ambiguity. Some approaches zhao2022monovit ; litemono focus on designing a more lightweight architecture to efficiently combine attention and CNN.
Robust Monocular Depth Estimation in adverse weather conditions is challenging due to diverse types of degradation, and the inefficiency of learning from real-world data. Some approaches are tailored to a specific condition vankadari2020ADFA ; wang2021RNW ; zhao2022ITDFA ; 23ICRA_Steps ; DCLdepth or daytime-nighttime conditions liu2021ADIDS ; vankadari2023sun ; spencer2020defeat , either in a discriminative learning manner, utilizing generative models CycleGAN2017 for data augmentation, or exploring image enhancement to a certain condition. Recent methods gasperini_morbitzer2023md4all ; Saunders_2023_ICCV ; diffusionadverse ; wang2023weatherdepth ; diffusion_contrast ; syn2real-depth address multi-condition RMDE by generating realistic adverse data via GANs or diffusion models CycleGAN2017 ; zheng_2020_forkgan ; LDM ; controlnet , or designing a synthetic-to-real adaptation strategy. However, they often rely on a costly data synthesis process as well as the synthetic quality, and struggle to generalize to unseen conditions.
Parameter-efficient Fine-tuning (PEFT) of Vision Foundation Models (VFMs) targets at fine-tuning a very small subset of parameters to adapt the large VFM to downstream tasks. Recent methods include: Low-Rank Adaptation hu2022lora ; yun2024soma ; miLoRA ; liu2024dora ; fan2025makeloragreatagain ; meng2024pissa , which inject low-rank tuning branches into the model, achieving minimal disruption to its core representational structure; Adapter-tuning approaches chen2022adaptformer ; bi2024fada ; set ; Rein ; mfuser , which insert lightweight modules after each layer to refine the latent representations without modifying the original backbone; and Prompt-tuning methods lester2021powerscaleparameterefficientprompt ; jia2022visualprompttuning , introduce learnable tokens to the input of selected attention layers, enabling task-specific conditioning without modifying model weights. Many other domain generalization or adaptation tasks domain_human_pose ; adaptive-domain-gen ; heap also consider this. However, previous PEFT methods for VFMs focus on semantic tasks, with limited exploration of low-level, geometry-centric applications. Given the semantic bias of VFMs, adapting them to tasks like RMDE requires weight tuning. To this end, we build on the low-rank adaptation paradigm and propose a novel PEFT strategy for robust depth estimation.
3 Preliminary
3.1 Self-supervised Learning in MDE
The self-supervised MDE assumes there exists photometric consistency between consecutive frames of monocular videos 2017cvpr_egomotion ; 2017left-right-consistency , which is usually held in daytime conditions of good visibility. In our self-supervised pipeline, we follow the common practice of Monodepth2 godard2019digging to conduct self-supervised training. Given a target frame and a source frame , a depth network and a pose network are trained simultaneously to minimize the objective of photometric error:
[TABLE]
where PE denotes the photometric error, is a weighting factor, represents known camera intrinsics, denotes the sampling operator, and is the projection operation. means projecting pixels from to , and and represent the output of and , respectively. In our method, we follow the previous design of , but dig into PEFT of VFMs in .
3.2 Low-rank Adaptation (LoRA)
LoRA hu2022lora assumes that the weight change of fine-tuning for pretrained weight matrices can be modeled within a low-rank structure. Assuming the pretrained weight matrix inside VFMs is denoted as , and the update matrix for can be defined as . Then can be decomposed in a low-rank manner:
[TABLE]
where , , and is the intrinsic rank with . Noted that is initialized with a uniform Kaiming distribution he2015delvingdeeprectifierssurpassing , and is initialized with zeros. This guarantees that is set to the zero matrix initially. The final weight matrix is defined as:
[TABLE]
Notably, this design requires updating only a quite small number of parameters compared to full fine-tuning (FFT), while introducing negligible computational overhead during inference.
3.3 Singular Value Decomposition (SVD) and Effective Ranks (e-ranks) of Matrices
The rank of a matrix can be obtained from its Singular Value Decomposition by the number of non-zero singular values. Eckart–Young–Mirsky theorem Eckart_Young_1936 revealed that the top singular components can be summed up to represent the core of the matrix. However, in real-world scenarios, large matrices often have full rank, even though many of the components are dominated by noise or carry little useful information. We go into detailed information through the singular spectrum of weight matrices, with Effective Rank, which is the real-valued extension of rank.
Definition 1. Effective Rank from Entropy of Singular Values (entropy rank). Assuming a matrix , which can be decomposed via SVD into
[TABLE]
where , correspond to the left singular vector, singular value, and right singular vector. entropyrank models as a probability distribution that . Then, the entropy of such a distribution is obtained by , and the entropy rank is defined as
[TABLE]
Notably, as discussed in entropyrank , the entropy rank reflects the dispersion of information across singular directions–larger values correspond to more evenly distributed singular values.
Definition 2. Effective Rank from Numerical Perspective of Singular Values (stable rank). stablerank1 ; stablerank2 ; stablerank3 * define the stable rank from the relative magnitudes through all singular values. Since we have , the stable rank is defined as *
[TABLE]
The stable rank reflects the degree of concentration of energy in the leading singular directions. A lower stable rank indicates that most energy is concentrated in a few directions.
Lemma 1. Stable rank is upper bounded by entropy rank. The stable rank is always less than or equal to the entropy rank, assuming that : . The detailed proof is provided in the Appendix.
4 Methodology
Fig. 2 gives an overview of our proposed method. We start by analyzing the weight matrices inside VFMs via SVD through different layers in Fig. 1, clearly studying how different weight matrices with different singular value distributions have been changed after full fine-tuning (FFT) in Sec. 4.1. With the observations, we propose the Selecting, Tuning, Maintaining (STM) Strategy for RMDE via PEFT of VFMs within each linear layer in Sec. 4.2, and our training pipeline in Sec. 4.3.
4.1 Weight Decomposition and Analysis inside VFMs
Inspired by yun2024soma , we first decompose the weight matrices inside a frozen VFM (e.g., DINOv2 oquab2023dinov2 ) into Eq. 6. The singular values across different layers can be seen in row of Fig. 1(a), where we plot the spectra within , corresponding to the layers. It is evident that deeper layers (e.g., Layer ) exhibit a more uniform singular value distribution compared to shallower ones (e.g., Layer [math]), as reflected by the slower decay curves in their spectra. This indicates that in shallow layers, very few components can stably provide most of the information, and it would be better not to disturb such a structure with large rank adaptation, retaining pretrained robustness.
SoMA yun2024soma directly freezes the shallow layers, and uniformly samples the last singular components from each weight matrix for LoRA initialization. Considering that RMDE is not strongly correlated with semantic understanding inside VFMs, we step further to find how weight matrices change from to after full fine-tuning (FFT) on a few daytime data. Here, the residual weight is defined as . Similarly, we plot the singular values of in row of Fig. 1(a). As with , the spectra in shallower layers exhibit a much sharper decay. What’s more, in order to find how the magnitude of each direction in Eq. 6 changes, we project to via , as visualized in row of Fig. 1(a)–the projection amplitudes in shallow layers decay rapidly and remain smooth, suggesting minimal disruption to their singular structure. In contrast, deeper layers exhibit fluctuations across a broader range of directions, indicating that tuning perturbs a wide range of singular components. These observations lead to two important insights: (1) Tuning tends to induce higher-rank transformations in deeper weights, while structural changes in shallower layers remain relatively minor; (2) In shallow layers, the perturbations are mostly concentrated in top singular components, whereas in deeper layers, the changes are more uniformly spread across the entire singular spectrum.
Inspired by the above, we aim to quantify how much different layers have changed by introducing the computation of entropy rank in Eq. 7 and stable rank in Eq. 8, and plot them through all layers in Fig. 1(b). Both the entropy and stable ranks of increase with layer depth, indicating greater structural complexity in deeper layers. In contrast, remains with a lower rank, though the rank also grows across layers, suggesting broader but still sparse perturbations. These trends motivate our Selecting-Tuning-Maintaining design: (1) Selecting the proper budget and task-aware directions for low-rank tuning; (2) Tuning only the low-rank branch to adapt to target tasks; (3) Maintaining the directions with concentration of energy from stable-rank.
4.2 Selecting-Tuning-Maintaining (STM) Strategy
As shown in Fig. 2, the STM strategy starts by selecting proper objectives for low-rank adaptation initialization. A good initialization should include two keys: How significant to tune (ranks) and what way to tune (directions). Let the pretrained weight, full-tuned weight, and full-tuned residual weight be denoted as , respectively. The low-rank tuning matrices are defined as , . Then, based on the observations in Sec. 4.1, we can set a simple but meaningful linear correlation between and the entropy rank within :
[TABLE]
where serves as a scaling factor to control the magnitude of low-rank adaptation. This design allows flexible tuning capacity based on the singular spectrum of each pretrained weight matrix, distinguishing our approach from prior methods. Apart from the choice of ranks, we also explore which singular components are chosen for initialization. To achieve that, we extract the absolute projection value from the full-tuned residual weight on the directions of the pretrained weight:
[TABLE]
where contains the absolute diagonal values of the projection matrix. Different from task_specific_direction that utilizes relative change from projection for semantic-correlated tasks, we notice that in RMDE the relative changes in the least significant directions tend to be disproportionately large—primarily due to their near-zero base magnitudes—resulting in unstable and uninformative direction selection. Thus, we obtain directions with top values in as task-aware directions:
[TABLE]
where denotes the indices of the top-r singular directions with the highest projection magnitudes. Having the selected rank and direction list , low-rank tuning parameters can be intialized by
[TABLE]
and the pretrained weight at the start of tuning becomes
[TABLE]
denotes selecting the components from the indices . With the proper selection and initialization, the low-rank tuning can be more efficient in training to adapt to the RMDE task.
Meanwhile, it is also crucial to preserve the generalization and robustness of the pretrained VFM during tuning. Given that the stable rank objectively reflects the concentration of energy within a weight matrix, we leverage it to identify the leading singular components to be preserved. These directions serve as structural anchors, preventing the adapted model from drifting away from the pretrained prior. This inspires us to design a component-preservation regularization for maintaining:
[TABLE]
where represents the number of layers. This regularization guarantees to protect the pretrained components, while not preventing the adaptation of those task-aware directions. Overall, the STM strategy balances task-specific learning and generalization preservation through entropy-rank and stable-rank-guided low-rank adaptation.
4.3 Training Pipelines
Our training pipeline is designed for both self-supervised learning and supervised learning in RMDE. In self-supervised RMDE, we follow the common practice in previous approaches gasperini_morbitzer2023md4all ; godard2019digging to design the objectives with Eq. 1 and a smooth loss as follows:
[TABLE]
Considering the extreme sparsity of LiDAR data in the real-world dataset nuscenes2019 ; robotcar , which is also utilized and mentioned by md4all gasperini_morbitzer2023md4all , we add auxiliary pseudo labels generated by DepthAnything V2 depth_anything_v2 , to provide a dense supervisory signal. The loss with LiDAR data is chosen as an absolute relative loss similar to gasperini_morbitzer2023md4all , and the dense supervision from pseudo labels is chosen as a normalized L1 loss. Our supervised RMDE objective is designed as
[TABLE]
where we balance to rely more on ground truth data. Noted that all the training is conducted with a few daytime data only, and generalizes to unseen adverse conditions, which is different from depth_anything_v2 ; depthanything that learn from a huge amount of data with depth labels or pseudo labels.
5 Experiments
5.1 Implementation Details
In our experiments, by default, we use DINOv2-L oquab2023dinov2 as the frozen VFM, and DPT DPT2021 as the decoder. In the training, the learning rates for the backbone, decoder, and low-rank tuning are set to 5e-6, 2e-5, and 2e-5. All experiments are trained for epochs, except for full fine-tuning under supervised learning, we follow md4all gasperini_morbitzer2023md4all to report results using the last checkpoint before the ripple artifacts caused by sparse ground truth supervision become evident.
5.2 Dataset and Evaluation Protocol
We evaluate our method on four real-world outdoor datasets under diverse weather and illumination conditions. Specifically, NuScenes (NS) nuscenes2019 provides 15,129 daytime images for training and 6,019 testing images (including 4,449 from daytime, 1,088 from rain, and 602 from nighttime). RobotCar (RC) robotcar contains 17,790 daytime training images and 1,411 test images (702 daytime, 709 nighttime). DrivingStereo (DS) drivingstereo offers 500 rainy and 500 foggy images for zero-shot evaluation. CADC cadc contains 510 real snow images for zero-shot testing. All images are input to the models with a resolution of .
Our goal is to enable vision foundation models (VFMs) to generalize from normal daytime data to arbitrary unseen adverse conditions, without the need for synthetic adverse data. To this end, we train our model only on daytime images and evaluate on all unseen domains. For self-supervised experiments, we follow prior practices gasperini_morbitzer2023md4all ; diffusionadverse and incorporate a weak velocity loss to stabilize monocular training. The evaluation protocol follows MiDaS birkl2023midas , aligning all predictions to the ground truth using scale and shift before computing depth errors. We use an evaluation range of for all datasets except RobotCar, which uses following prior convention.
In the self-supervised setting, we compare with four categories of baselines: (1) existing self-supervised robust depth estimation methods gasperini_morbitzer2023md4all ; Li_2023_ICCV ; wang2021RNW ; diffusionadverse ; (2) PEFT methods Rein ; hu2022lora ; yun2024soma under self-supervised monocular depth frameworks; (3) full fine-tuning and freezing of VFMs on daytime data only; and (4) full fine-tuning of VFMs using synthetic data from gasperini_morbitzer2023md4all . In the supervised setting, we compare against: (1) supervised RMDE approaches md4all gasperini_morbitzer2023md4all ; (2) PEFT-based supervised tuning methods; (3) full fine-tuning of VFMs on labeled data; and (4) a state-of-the-art MDE model, Depth Anything v2 depth_anything_v2 . All methods are evaluated under the same protocol for fair comparison, with the normally used metrics: absolute relative error (AbsRel), square relative error (SqRel), root mean square error (RMSE), and threshold accuracy (.
5.3 Experimental Results
Tab. 1 reports quantitative results under the self-supervised setting on the nuScenes nuscenes2019 dataset. Our method achieves the best overall performance across unseen adverse domains (nighttime and rain). Compared to prior PEFT methods, our approach reduces the average AbsRel of 5.5%, and RMSE of 3.1%, while using similar trainable parameters. Against synthetic-data-based robust depth baselines, we achieve significantly better robustness. Noted that full fine-tuning leads to noticeable degradation of the robust priors encoded in the pretrained VFM. Moreover, introducing synthetic adverse data tends to interfere with the knowledge learned from large-scale natural image distributions, ultimately harming the VFM’s generalization.
Tab. 2 presents zero-shot results on five unseen domains, learning from only nuScenes-daytime. Our method achieves the best overall generalization, outperforming all prior methods across average AbsRel, RMSE, and . Compared to the latest PEFT method, SoMA yun2024soma , our approach improves average AbsRel and RMSE by 5.1% and 5.7%, respectively, and surpasses full fine-tuning (FFT) by 4.4% and 2.1%. These validate the robustness of our STM strategy under diverse distribution shifts.
As shown in Table 3, our method, trained on only 15K daytime images, consistently outperforms all baselines, including PEFT, FFT, and Depth Anything V2 depth_anything_v2 . It achieves an average improvement of 7.4% AbsRel and 6.1% RMSE, demonstrating superior generalization under limited supervision.
5.4 Ablations
To validate the effectiveness of each proposed component, we conduct an ablation study with controlled variants. Specifically, denotes dynamic assignment of rank numbers per weight matrix, refers to selecting task-aware directions for initialization, and means applying regularization on important singular directions. The results are summarized in Table 4.
Overall, these modules are designed to jointly enhance generalization and stability: adapts model capacity to each weight matrix’s structure, ensures alignment between initialization and downstream tasks, and preserves essential representational subspaces during adaptation.
6 Conclusion
In this work, we propose ER-LoRA, a novel parameter-efficient fine-tuning (PEFT) pipeline for robust monocular depth estimation under diverse weather conditions. By analyzing the singular value spectrum of vision foundation models (VFMs), we introduce a Selecting–Tuning–Maintaining (STM) strategy that leverages effective rank to guide task-relevant initialization, discriminative tuning, and structural regularization. This design enables ER-LoRA to adapt to unseen domains without relying on synthetic data, while preserving the strong generalization ability of pretrained VFMs. Experiments on diverse real-world datasets under varying conditions validate the effectiveness of our method in both self-supervised and supervised settings, outperforming a wide range of existing approaches across different methodological branches.
Acknowledgments and Disclosure of Funding
Use unnumbered first level headings for the acknowledgments. All acknowledgments go at the end of the paper before the list of references. Moreover, you are required to declare funding (financial activities supporting the submitted work) and competing interests (related financial activities outside the submitted work). More information about this disclosure can be found at: https://neurips.cc/Conferences/2025/PaperInformation/FundingDisclosure.
Do not include this section in the anonymized submission, only in the final paper. You can use the ack environment provided in the style file to automatically hide this section in the anonymized submission.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Yihao Ai, Yifei Qi, Bo Wang, Yu Cheng, Xinchao Wang, and Robby T Tan. Domain-adaptive 2d human pose estimation via dual teachers in extremely low-light conditions. In European Conference on Computer Vision , pages 221–239. Springer, 2024.
- 2[2] Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias Müller. Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023.
- 3[3] Qi Bi, Jingjun Yi, Hao Zheng, Haolan Zhan, Yawen Huang, Wei Ji, Yuexiang Li, and Yefeng Zheng. Learning frequency-adapted vision foundation model for domain generalized semantic segmentation. In Advances in Neural Information Processing Systems (Neur IPS) , volume 37, pages 94047–94072, 2024.
- 4[4] Jia-Wang Bian, Huangying Zhan, Naiyan Wang, Zhichao Li, Le Zhang, Chunhua Shen, Ming-Ming Cheng, and Ian Reid. Unsupervised scale-consistent depth learning from video. International Journal of Computer Vision (IJCV) , 2021.
- 5[5] Reiner Birkl, Diana Wofk, and Matthias Müller. Midas v 3.1 – a model zoo for robust monocular relative depth estimation. ar Xiv preprint ar Xiv:2307.14460 , 2023.
- 6[6] Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. ar Xiv preprint ar Xiv:1903.11027 , 2019.
- 7[7] Xiao Cao, Beibei Lin, Bo Wang, Zhiyong Huang, and Robby T. Tan. Ssnerf: Sparse view semi-supervised neural radiance fields with augmentation. ar Xiv preprint ar Xiv:2408.09144 , 2024.
- 8[8] Xiao Cao, Beibei Lin, Bo Wang, Zhiyong Huang, and Robby T. Tan. 3dswapping: Texture swapping for 3d object from single reference image. ar Xiv preprint ar Xiv:2503.18853 , 2025.
