Cameras as Rays: Pose Estimation via Ray Diffusion
Jason Y. Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan,, Shubham Tulsiani

TL;DR
This paper introduces a novel distributed ray-based representation for camera pose estimation from sparse views, leveraging transformers and diffusion models to improve accuracy and uncertainty modeling, achieving state-of-the-art results.
Contribution
It proposes a new ray-based distributed representation for camera pose estimation, combined with transformer and diffusion models, advancing accuracy and generalization in sparse-view scenarios.
Findings
State-of-the-art performance on CO3D dataset
Effective generalization to unseen categories
Improved pose precision with ray-based representation
Abstract
Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparsely sampled views (<10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrinsics, we propose a distributed representation of camera pose that treats a camera as a bundle of rays. This representation allows for a tight coupling with spatial image features improving pose precision. We observe that this representation is naturally suited for set-level transformers and develop a regression-based approach that maps image patches to corresponding rays. To capture the inherent uncertainties in sparse-view pose inference, we adapt this approach to learn a denoising diffusion model which allows us to sample plausible modes while improving performance. Our proposed methods, both regression- and diffusion-based, demonstrate…
Peer Reviews
Decision·ICLR 2024 oral
- I think this is a good method of formulating camera pose and intrinsic recovery using a bundle of rays. Furthermore, the authors' observation that ray-based representation is well-suited for set-level transformers is well backed by the results. - The authors' "regression" based method outperforms other "diffusion" based methods, which shows that over-parameterization is really helping solve for camera geometry accurately. - The results outperforms currently available "leaning" based and "cor
- One dataset is too small to see the applicability of a method. Since I see this method as superior to "PoseDiffusion", it would be great to see some results on the "scene-centric" dataset and compare it against PoseDiffusion. - It would be nice to see a "memory" requirement to run these models. Processing N image features together, I am assuming requires a good amount of GPU memory. - It would also be nice to see accuracy at different thresholds i.e. @5, @10, @15. - It would also be nice to se
The strengths of this paper are the novelty of the approach and the quality of results, which together are likely to have a significant impact in the field of wide baseline camera estimation. Directly regressing rays intuitively makes sense as they more suited to regression by a neural network, since each ray depends on more local image information. The paper makes this point clear and backs it up with experimental results. Overall, the paper is written clearly and is easy to understand.
The main weakness of this paper is the somewhat contrived and limited dataset and metrics used in the experimental results. The CO3D dataset consists of many turntable-like videos with a camera orbiting in a circle around a single object of interest at an approximately fixed distance. The variability of camera poses is quite limited compared to images in the wild. Furthermore the image is tightly cropped around the object of interest. This tight cropping ensures that most rays sampled pass t
1. The authors propose a novel representation of pose that allows a bundle rays to denote camera in the field of sparse-view pose estimation. 2. To inference the rays, the authors develop a deterministic regression network and a probabilistic diffusion model, and the experiment on the CO3D demonstrates the superior performance.
1. The authors announce that the traditional representation of pose maybe suboptimal in neural learning in the part of introduction. However, no further discussion is given. More specific explanation is necessary, and the comparison with the proposed novel representation of pose is also required. 2. The punctuation is necessary at the end of each equation, please check it carefully. 3. The authors fail to state more details of the proposed network architecture. Moreover the training detail is al
Code & Models
Videos
Taxonomy
TopicsAdvanced Vision and Imaging · Satellite Image Processing and Photogrammetry · Computer Graphics and Visualization Techniques
MethodsDiffusion
