SenseShift6D: Multimodal RGB-D Benchmarking for Robust 6D Pose Estimation across Environment and Sensor Variations
Yegyu Han, Taegyoon Yoon, Dayeon Woo, Sojeong Kim, Hyung-Sin Kim

TL;DR
SenseShift6D introduces a comprehensive RGB-D dataset capturing real-world environmental and sensor variations, revealing significant robustness challenges in current 6D pose estimation methods and emphasizing the need for sensor-aware solutions.
Contribution
This paper presents the first large-scale RGB-D dataset with diverse environmental and sensor variations, and demonstrates the sensitivity of existing pose estimators to these factors, highlighting new challenges and opportunities.
Findings
State-of-the-art estimators show performance drops across lighting and sensor changes.
Sensor- and environment-aware robustness is crucial for real-world 6D pose estimation.
Test-time multimodal sensor selection can significantly improve pose estimation accuracy.
Abstract
Recent advances on 6D object pose estimation have achieved high performance on representative benchmarks such as LM-O, YCB-V, and T-Less. However, these datasets were captured under fixed illumination and camera settings, leaving the impact of real-world variations in illumination, exposure, gain or depth-sensor mode largely unexplored. To bridge this gap, we introduce SenseShift6D, the first RGB-D dataset that physically sweeps 13 RGB exposures, 9 RGB gains, auto-exposure, 4 depth-capture modes, and 5 illumination levels. For six common household objects, we acquire 198.8k RGB and 20.0k depth images (i.e., 795.4k RGB-D scenes), providing 1,380 unique sensor-lighting permutations per object pose. Experiments with state-of-the-art pretrained, generalizable pose estimators reveal substantial performance variation across lighting and sensor settings, despite their large-scale pretraining.…
Peer Reviews
Decision·ICLR 2026 Conference Desk Rejected Submission
- The idea of incorporating several camera parameter variations in object pose dataset collection is nice. - This benchmark explores RGBD sensor parameter variation (especially photometric parameters) for 6D pose estimation. This is an interesting and important aspect of the object pose estimation problem. - The benchmark captures real camera effect under different parameter variation. This effect is not easily re-produced in synthetic data. This will be valuable for studying the problem. - The
- The dataset is not marker-less. There are a lot of AR tags on the board beneath the object. It is OK since other datasets such as LineMOD also did this. But this dataset would still be far away from "data in the wild". - This dataset only contains 5 objects which is quite small in number compared to other related datasets. - It only contains single-object tabletop scenes without occlusion or multi-objects scenes. - Not sure if all RGB and depth sensors can control their modes to allow varying
1. The paper is well written and structured. 2. The paper is well-motivated and contributes meaningfully to the 6D pose estimation research. 3. The extensive evaluation of existing algorithms on the benchmark presents many valuable insights.
1. The collected dataset has the ChArUco board presented in the images, also for the other well-known dataset. The question is whether the ChArUco board could potentially introduce bias to the pose estimation algorithms if the estimation algorithm is trained on a dataset that includes the ChArUco board. See more detailed discussion in [1]. I suggest removing the calibrator after calibration is done, as in the YCB dataset. 2. The number of objects and the background of the objects are limited. Th
- Introduction of an extensive 6D pose dataset that incorporates diverse illuminations, exposure, gain and depth levels. This enables benchmarking the robustness pose estimation models in challenging real-world conditions. - Comprehensive evaluation on different sensor configurations with pretrained models as well as instance-level pose estimation models. - The AUC with optimal sensor control is at least +15 points higher compared to the baseline auto-exposure sensor configurations, indicating t
- There is no validation split, instead there is only a training and test split, which might lead to potential overfitting of the evaluated models. - The motivation is sensor-aware test-time adaptation, but the paper only shows the Oracle upper bound and not a single practical adaptation method. - Limited number of objects in the dataset (5 objects) may lead to non-generalizable conclusions.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsRobot Manipulation and Learning · Human Pose and Action Recognition · Robotics and Sensor-Based Localization
