No Pose Estimation? No Problem: Pose-Agnostic and Instance-Aware Test-Time Adaptation for Monocular Depth Estimation
Mingyu Sung, Hyeonmin Choe, Il-Min Kim, Sangseok Yun, and Jae Mo Kang

TL;DR
This paper introduces PITTA, a novel test-time adaptation framework for monocular depth estimation that is pose-agnostic and instance-aware, significantly improving performance in diverse environments without requiring camera pose data.
Contribution
The paper presents a new TTA framework for MDE that does not rely on camera pose and uses instance-aware masking to handle dynamic objects, outperforming existing methods.
Findings
PITTA surpasses state-of-the-art methods on DrivingStereo and Waymo datasets.
Effective in diverse and dynamic environments without pose information.
Improves monocular depth estimation accuracy during test-time adaptation.
Abstract
Monocular depth estimation (MDE), inferring pixel-level depths in single RGB images from a monocular camera, plays a crucial and pivotal role in a variety of AI applications demanding a three-dimensional (3D) topographical scene. In the real-world scenarios, MDE models often need to be deployed in environments with different conditions from those for training. Test-time (domain) adaptation (TTA) is one of the compelling and practical approaches to address the issue. Although there have been notable advancements in TTA for MDE, particularly in a self-supervised manner, existing methods are still ineffective and problematic when applied to diverse and dynamic environments. To break through this challenge, we propose a novel and high-performing TTA framework for MDE, named PITTA. Our approach incorporates two key innovative strategies: (i) pose-agnostic TTA paradigm for MDE and (ii)…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. Clear motivation: The paper rightly points out and illustrates that depending on unreliable pose networks is one of the main weaknesses in the current TTA techniques. 2. Novel Compensation: The presented self-supervised losses (depth-refining and edge-guided), that essentially replace temporal consistency with semantic consistency. 3. Extensive & SOTA Experiments: The approach is state-of-the-art on two datasets (DrivingStereo, Waymo) and is demonstrated to be a general, plug-and-play metho
Dependency Swap: The approach does not eliminate dependencies, it only replaces the dependence on a pose network with the dependence on an equally complicated panoptic segmentation network. Unverified Core Assumption: The approach presumes that the frozen segmentation network is resilient to the same domain changes (e.g. fog, rain) that the MDE model encounters. This is an important assumption that is not tested or discussed. Lacking Computational Analysis: The computer analysis (e.g., FPS) is
- Clear motivation for making TTA independent of pose estimation, which is often brittle under domain shift. - Simple and modular: BN-only updates and plug-in components (instance masks and edge guidance) make it easy to integrate with standard MDE backbones.
Previously, we did evaluation of DepthAnything v2 on DrivingStereo fog/rain yields **δ<1.25 of 96.3/94.9**, whereas the paper reports **73.6/65.7**. This large gap raises concerns about experimental validity. It is our major concern. The approach critically depends on a *frozen panoptic segmenter* to create instance masks that directly modulate supervision. In adverse conditions (fog/rain/night/unusual classes), segmentation quality may degrade, potentially harming adaptation. The paper lacks a
1. This method is simple and broadly applicable to most depth estimation methods. 2. Extensive evaluations on diverse benchmark datasets verify that the method achieves state-of-the-art performance.
## Major Weaknesses **1. Omission of essential information:** - The authors did not provide qualitative comparisons between the proposed method and prior works. This evaluation is crucial, especially to demonstrate the advantages and distinctions introduced by the edge extraction component. - In the Overview of Architecture section, the authors allocated an excessive portion of the content to explaining related work rather than describing the proposed method. - The implementation details o
- The paper is well-written and easy to follow. - The paper provide in-depth analysis of the failure cases of existing works, with detailed explanation for the method section.
Method: - The method aims to improve depth of dynamic objects via median filtering. Such improved depth are then used as pseudo ground-truth depth to supervise the depth network. I have several concerns regarding this. First, and most importantly, I don't get the intuition for why median filtering improve depth of dynamic objects, and thus can be used as a target of learning. Please provide clarification on this. Second, it is unclear how the method classifies between static and dynamic object
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Vision and Imaging · Robot Manipulation and Learning · Human Pose and Action Recognition
