Monocular Normal Estimation via Shading Sequence Estimation
Zongrui Li, Xinhua Ma, Minghui Hu, Yunqing Zhao, Yingchen Yu, Qian Zheng, Chang Liu, Xudong Jiang, Song Bai

TL;DR
This paper introduces RoSE, a novel approach for monocular normal estimation that predicts shading sequences instead of normal maps directly, leading to better geometric alignment and state-of-the-art results.
Contribution
The paper proposes a new paradigm of shading sequence estimation for monocular normal estimation and leverages image-to-video generative models to improve accuracy.
Findings
RoSE achieves state-of-the-art performance on real-world benchmarks.
Shading sequence estimation improves geometric alignment over direct normal prediction.
Training on diverse synthetic data enhances robustness and generalization.
Abstract
Monocular normal estimation aims to estimate the normal map from a single RGB image of an object under arbitrary lights. Existing methods rely on deep models to directly predict normal maps. However, they often suffer from 3D misalignment: while the estimated normal maps may appear to have a correct appearance, the reconstructed surfaces often fail to align with the geometric details. We argue that this misalignment stems from the current paradigm: the model struggles to distinguish and reconstruct varying geometry represented in normal maps, as the differences in underlying geometry are reflected only through relatively subtle color variations. To address this issue, we propose a new paradigm that reformulates normal estimation as shading sequence estimation, where shading sequences are more sensitive to various geometric information. Building on this paradigm, we present RoSE, a…
Peer Reviews
Decision·ICLR 2026 Oral
- I appreciate the motivation and core observation of this paper, as well as how the method design naturally follows from the motivation. Specifically, I like the logical flow, identifying an underlying mechanism that may cause problems, analyzing it, and addressing it through targeted design choices. The approach feels somewhat “old school,” but I personally find it appealing. - The experimental results are good and convincing.
- The figures in the paper are of low resolution. Many images appear blurry and show visible JPEG compression artifacts when zoomed in, making it difficult to discern differences compared to the baseline methods. - As I mentioned in the paper summary, I think the authors could include more visualization results to help readers better appreciate the framework. For example, in the prediction pipeline, the authors use a video diffusion model to predict sequences of shading images. Could the author
1. It reformulates the task as shading sequence estimation effectively addresses the core issue of 3D misalignment and oversmoothness. It is a novel new formulation. 2. The new synthetic dataset includes diverse materials, light conditions, and material augmentation, successfully improving the model's generalization ability. The new dataset is also a contribution. 3. It achieves superior performance on key real-world benchmark datasets like DiLiGenT and LUCES.
1. The use of an image-to-video diffusion model for sequence generation introduces significant computational overhead, which may limit its use in real-time or resource-constrained applications. 2. The current evaluation is restricted to object-centric normal estimation, and generalizing the approach to complex scene-centric (indoor/outdoor) settings remains a key direction for future work.
Originality. The paper introduces a novel approach to monocular normal estimation by rethinking the problem in the context of shading sequence estimation. This conceptual shift is quite original, as traditional methods in normal estimation usually focus on individual lighting models or directly predicting normals from images. Quality. The paper delivers high-quality research with state-of-the-art results on multiple real-world benchmark datasets. Achieving superior performance compared to exis
1.Insufficient training details. While the paper states that RoSE is built upon the SV3D model (Voleti et al., 2024), the specific parameter configurations, training protocols, and dataset utilization are not described. Providing these details would significantly improve reproducibility. 2.Limited ablation studies. The current ablation experiments primarily focus on dataset influence. It would be valuable to include analyses of the model components — such as varying SV3D settings or substitutin
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsComputer Graphics and Visualization Techniques · 3D Shape Modeling and Analysis · Generative Adversarial Networks and Image Synthesis
