Cameras as Rays: Pose Estimation via Ray Diffusion

Jason Y. Zhang; Amy Lin; Moneish Kumar; Tzu-Hsuan Yang; Deva Ramanan,; Shubham Tulsiani

arXiv:2402.14817·cs.CV·April 5, 2024·3 cites

Cameras as Rays: Pose Estimation via Ray Diffusion

Jason Y. Zhang, Amy Lin, Moneish Kumar, Tzu-Hsuan Yang, Deva Ramanan,, Shubham Tulsiani

PDF

Open Access 1 Repo 1 Video 3 Reviews

TL;DR

This paper introduces a novel distributed ray-based representation for camera pose estimation from sparse views, leveraging transformers and diffusion models to improve accuracy and uncertainty modeling, achieving state-of-the-art results.

Contribution

It proposes a new ray-based distributed representation for camera pose estimation, combined with transformer and diffusion models, advancing accuracy and generalization in sparse-view scenarios.

Findings

01

State-of-the-art performance on CO3D dataset

02

Effective generalization to unseen categories

03

Improved pose precision with ray-based representation

Abstract

Estimating camera poses is a fundamental task for 3D reconstruction and remains challenging given sparsely sampled views (<10). In contrast to existing approaches that pursue top-down prediction of global parametrizations of camera extrinsics, we propose a distributed representation of camera pose that treats a camera as a bundle of rays. This representation allows for a tight coupling with spatial image features improving pose precision. We observe that this representation is naturally suited for set-level transformers and develop a regression-based approach that maps image patches to corresponding rays. To capture the inherent uncertainties in sparse-view pose inference, we adapt this approach to learn a denoising diffusion model which allows us to sample plausible modes while improving performance. Our proposed methods, both regression- and diffusion-based, demonstrate…

Peer Reviews

Decision·ICLR 2024 oral

Reviewer 01Rating 8· accept, good paperConfidence 4

Strengths

- I think this is a good method of formulating camera pose and intrinsic recovery using a bundle of rays. Furthermore, the authors' observation that ray-based representation is well-suited for set-level transformers is well backed by the results. - The authors' "regression" based method outperforms other "diffusion" based methods, which shows that over-parameterization is really helping solve for camera geometry accurately. - The results outperforms currently available "leaning" based and "cor

Weaknesses

- One dataset is too small to see the applicability of a method. Since I see this method as superior to "PoseDiffusion", it would be great to see some results on the "scene-centric" dataset and compare it against PoseDiffusion. - It would be nice to see a "memory" requirement to run these models. Processing N image features together, I am assuming requires a good amount of GPU memory. - It would also be nice to see accuracy at different thresholds i.e. @5, @10, @15. - It would also be nice to se

Reviewer 02Rating 8· accept, good paperConfidence 4

Strengths

The strengths of this paper are the novelty of the approach and the quality of results, which together are likely to have a significant impact in the field of wide baseline camera estimation. Directly regressing rays intuitively makes sense as they more suited to regression by a neural network, since each ray depends on more local image information. The paper makes this point clear and backs it up with experimental results. Overall, the paper is written clearly and is easy to understand.

Weaknesses

The main weakness of this paper is the somewhat contrived and limited dataset and metrics used in the experimental results. The CO3D dataset consists of many turntable-like videos with a camera orbiting in a circle around a single object of interest at an approximately fixed distance. The variability of camera poses is quite limited compared to images in the wild. Furthermore the image is tightly cropped around the object of interest. This tight cropping ensures that most rays sampled pass t

Reviewer 03Rating 5· marginally below the acceptance thresholdConfidence 3

Strengths

1. The authors propose a novel representation of pose that allows a bundle rays to denote camera in the field of sparse-view pose estimation. 2. To inference the rays, the authors develop a deterministic regression network and a probabilistic diffusion model, and the experiment on the CO3D demonstrates the superior performance.

Weaknesses

1. The authors announce that the traditional representation of pose maybe suboptimal in neural learning in the part of introduction. However, no further discussion is given. More specific explanation is necessary, and the comparison with the proposed novel representation of pose is also required. 2. The punctuation is necessary at the end of each equation, please check it carefully. 3. The authors fail to state more details of the proposed network architecture. Moreover the training detail is al

Code & Models

Repositories

jasonyzhang/raydiffusion
pytorch

Videos

Cameras as Rays: Pose Estimation via Ray Diffusion· slideslive

Taxonomy

TopicsAdvanced Vision and Imaging · Satellite Image Processing and Photogrammetry · Computer Graphics and Visualization Techniques

MethodsDiffusion