TL;DR
This paper introduces a novel method for depth estimation from single-camera dual-pixel data, leveraging hardware and optics understanding to improve accuracy and reduce model complexity, demonstrated on a new dataset.
Contribution
It identifies an ambiguity in dual-pixel depth cues and develops a learning approach that effectively estimates depth up to this ambiguity, enabling better monocular depth estimation.
Findings
Model achieves 30% higher accuracy than prior methods.
Effective use of dual-pixel hardware for depth estimation.
Small models can produce high-quality depth maps.
Abstract
Deep learning techniques have enabled rapid progress in monocular depth estimation, but their quality is limited by the ill-posed nature of the problem and the scarcity of high quality datasets. We estimate depth from a single camera by leveraging the dual-pixel auto-focus hardware that is increasingly common on modern camera sensors. Classic stereo algorithms and prior learning-based depth estimation techniques under-perform when applied on this dual-pixel data, the former due to too-strong assumptions about RGB image matching, and the latter due to not leveraging the understanding of optics of dual-pixel image formation. To allow learning based methods to work well on dual-pixel imagery, we identify an inherent ambiguity in the depth estimated from dual-pixel cues, and develop an approach to estimate depth up to this ambiguity. Using our approach, existing monocular depth estimation…
| Folded Loss | .0225 | .0318 | .195 |
|---|---|---|---|
| 3D Assisted Loss | .0175 | .0264 | .139 |
| Method | Invariance | Evaluated on Our Depth | Evaluated on COLMAP Depth | Geometric | ||||
| Mean | ||||||||
| RGB Input | ||||||||
| DPNet | None | .0602 | .0754 | .631 | .0607 | .0760 | .652 | .1432 |
| Scale | .0409 | .0544 | .490 | .0419 | .0557 | .514 | .1047 | |
| Affine | .0398 | .0530 | .464 | .0410 | .0546 | .493 | .1014 | |
| DORN [19] (NYUDv2 model) | .0421 | .0555 | .407 | .0426 | .0557 | .419 | .0990 | |
| DORN [19] (KITTI model) | .0490 | .0631 | .549 | .0492 | .0630 | .558 | .1196 | |
| RGB + DP Input | ||||||||
| DPNet | None | .0581 | .0735 | .827 | .0587 | .0742 | .834 | .1530 |
| Scale | .0202 | .0295 | .162 | .0213 | .0322 | .178 | .0477 | |
| Affine | .0175 | .0264 | .139 | .0190 | .0298 | .156 | .0422 | |
| VGG | None | .0370 | .0492 | .350 | .0383 | .0513 | .360 | .0876 |
| Scale | .0224 | .0321 | .181 | .0242 | .0356 | .208 | .0535 | |
| Affine | .0186 | .0275 | .149 | .0202 | .0308 | .166 | .0446 | |
| Godard et al. [22] (ResNet50) | None† | .0562 | .0714 | .738 | .0568 | .0720 | .745 | .1442 |
| Scale† | .0260 | .0367 | .227 | .0270 | .0383 | .239 | .0613 | |
| Affine† | .0251 | .0356 | .222 | .0257 | .0366 | .232 | .0592 | |
| Garg et al. [20] (ResNet50) | None | .0571 | .0722 | .761 | .0577 | .0728 | .772 | .1472 |
| Scale† | .0261 | .0369 | .228 | .0267 | .0382 | .237 | .0613 | |
| Affine† | .0248 | .0352 | .216 | .0255 | .0365 | .227 | .0584 | |
| Wadhwa et al. [55] | .0270 | .0375 | .236 | .0276 | .0388 | .245 | .0630 | |
| DPNet trained on Stereo | .0218 | .0319 | .180 |
|---|---|---|---|
| DPNet trained on Multi-view | .0175 | .0264 | .139 |
| Extended test set | .0188 | .0276 | .153 |
|---|---|---|---|
| Standard test set | .0175 | .0264 | .139 |
| Method | Invariance | Percentile Based WRMSE | |
|---|---|---|---|
| Our Depth | COLMAP Depth | ||
| DPNet (RGB) | Scale | .0890 | .0908 |
| Affine | .1502 | .1484 | |
| DPNet (RGB+DP) | Scale | .0390 | .0417 |
| Affine | .0328 | .0368 | |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Learning Single Camera Depth Estimation using Dual-Pixels
Rahul Garg Neal Wadhwa Sameer Ansari Jonathan T. Barron
Google Research
Abstract
Deep learning techniques have enabled rapid progress in monocular depth estimation, but their quality is limited by the ill-posed nature of the problem and the scarcity of high quality datasets. We estimate depth from a single camera by leveraging the dual-pixel auto-focus hardware that is increasingly common on modern camera sensors. Classic stereo algorithms and prior learning-based depth estimation techniques underperform when applied on this dual-pixel data, the former due to too-strong assumptions about RGB image matching, and the latter due to not leveraging the understanding of optics of dual-pixel image formation. To allow learning based methods to work well on dual-pixel imagery, we identify an inherent ambiguity in the depth estimated from dual-pixel cues, and develop an approach to estimate depth up to this ambiguity. Using our approach, existing monocular depth estimation techniques can be effectively applied to dual-pixel data, and much smaller models can be constructed that still infer high quality depth. To demonstrate this, we capture a large dataset of in-the-wild 5-viewpoint RGB images paired with corresponding dual-pixel data, and show how view supervision with this data can be used to learn depth up to the unknown ambiguity. On our new task, our model is more accurate than any prior work on learning-based monocular or stereoscopic depth estimation.
1 Introduction
Depth estimation has long been a central problem in computer vision, both as a basic component of visual perception, and in service to various graphics, recognition, and robotics tasks. Depth can be acquired via dedicated hardware that directly senses depth (time-of-flight, structured light, etc) but these sensors are often expensive, power-hungry, or limited to certain environments (such as indoors). Depth can be inferred from multiple cameras through the use of multi-view geometry, but building a stereo camera requires significant complexity in the form of calibration, rectification, and synchronization. Machine learning techniques can be used to estimate depth from a single image, but the under-constrained nature of image formation often results in inaccurate estimation.
Recent developments in consumer hardware may provide an opportunity for a new approach in depth estimation. Cameras have recently become available that allow a single camera to simultaneously capture two images that resemble a stereo pair with a tiny baseline (Fig. 1), through the use of dense dual-pixel (DP) sensors (Fig. 2). Though this technology was originally developed in service of camera auto-focus, dual-pixel images can also be exploited to recover dense depth maps from a single camera, thereby obviating any need for additional hardware, calibration, or synchronization. For example, Wadhwa et al. [55] used classical stereo techniques (block matching and edge aware smoothing) to recover depth from DP data. But as shown in Fig. 1, the quality of depth maps that can be produced by conventional stereo techniques is limited, because the interplay between disparity and focus in DP imagery can cause classic stereo-matching techniques to fail. Existing monocular learning-based techniques also perform poorly on this task. In this paper, we analyze the optics of image formation for dual-pixel imagery and demonstrate that DP images have a fundamentally ambiguous relationship with respect to scene depth — depth can only be recovered up to some unknown affine transformation. With this observation, we analytically derive training procedures and loss functions that incorporate prior knowledge of this ambiguity, and are therefore capable of learning effective models for affine-invariant depth estimation. We then use these tools to train deep neural networks that estimate high-quality depth maps from DP imagery, thereby producing detailed and accurate depth maps using just a single camera. Though the output of our learned model suffers from the same affine ambiguity that our training data does, the affine-transformed depths estimated by our model can be of great value in certain contexts, such as depth ordering or defocus rendering.
Training and evaluating our model requires large amounts of dual-pixel imagery that has been paired with ground-truth depth maps. Because no such dataset exists, in this work we also design a capture procedure for collecting “in the wild” dual-pixel imagery where each image is paired with multiple alternate views of the scene. These additional views allow us to train our model using view supervision, and allow us to use multi-view geometry to recover ground-truth estimates of the depth of the scene for use in evaluation. When comparing against the state-of-the-art in depth estimation, our proposed model produces error rates that are lower than previous dual-pixel and monocular depth estimation approaches.
2 Related Work
Historically, depth estimation has seen the most attention and progress in the context of stereo [49] or multi-view geometry [25], in which multiple views of a scene are used to partially constrain its depth, thereby reducing the inherent ambiguity of the problem. Estimating the depth of a scene from a single image is significantly more underconstrained, and though it has also been an active research area, progress has happened more slowly. Classic monocular depth approaches relied on singular cues, such as shading [31], texture [7], and contours [13] to inform depth, with some success in constrained scenarios. Later work attempted to use learning to explicitly consolidate these bottom-up cues into more robust monocular depth estimation techniques [10, 29, 48], but progress on this problem accelerated rapidly with the rise of deep learning models trained end-to-end for monocular depth estimation [17, 19], themselves enabled by the rise of affordable consumer depth sensors which allowed collection of large RGBD datasets [34, 43, 51]. The rise of deep learning also yielded progress in stereoscopic depth estimation [56] and in the related problem of motion estimation [16]. The need for RGBD data in training monocular depth estimation models was lessened by the discovery that the overconstraining nature of multi-view geometry could be used as a supervisory cue for training such systems [18, 20, 22, 37, 42, 57], thereby allowing “self-supervised” training using only video sequences or stereo pairs as input. Our work builds on these monocular and stereo depth prediction algorithms, as we construct a learning-based “stereo” technique, but using the impoverished dual-pixel data present within a single image.
An alternative strategy to constraining the geometry of the scene is to vary the camera’s focus. Using this “depth from (de)focus” [24] approach, depth can be estimated from focal stacks using classic vision techniques [53] or deep learning approaches[26]. Focus can be made more informative in depth estimation by manually “coding” the aperture of a camera [40], thereby causing the camera’s circle of confusion to more explicitly encode scene depth. Focus cues can also be used as supervision in training a monocular depth estimation model [52]. Reasoning about the relationship between depth and the apparent focus of an image is critical when considering dual-pixel cameras, as the effective point spread functions of the “left” and “right” views are different. By using a flexible learning framework, our model is able to leverage the focus cues present in dual-pixel imagery in addition to the complementary stereo cues.
Stereo cameras and focal stacks are ways of sampling what Adelson and Bergen called “the plenoptic function”: a complete record of the angle and position of all light passing through space [3]. An alternative way of sampling the plenoptic function is a light field [41], a 4D function that contains conventional images as 2D slices. Light fields can be use to directly synthesize images from different positions or with different aperture settings [44], and light field cameras can be made by placing a microlens array on the sensor of a conventional camera [4, 45]. Light fields provide a convenient framework for analyzing the equivalence of correspondence and focus cues [54]. While light fields have been used to recover depth [35, 36], constructing a light field camera requires sacrificing spatial resolution in favor of angular resolution, and as such light field cameras have not seen rapid consumer adoption. Dual-pixel cameras appear to represent a promising compromise between more ambitious light field cameras and conventional cameras — DP cameras sacrifice a negligible amount of spatial resolution to sample two angles in a light field, while true monocular cameras sample only a single angle, and light field cameras such as the Lytro Illum sample 196 angles at the cost of significant spatial resolution. As a result, they have seen wider adoption in consumer cameras and in space-constrained applications like endoscopy [6].
3 Dual-Pixel Geometry and Ambiguity
Dual-pixel (DP) sensors work by splitting each pixel in half, such that the left half integrates light over the right half aperture and vice versa (Fig. 3). Because each half of a dual-pixel integrates light over one half of the aperture, the two halves of a pixel together form a kind of stereo pair, in which nearby objects exhibit some horizontal disparity between the two views in accordance with their distance. This effect interacts with the optical blur induced by the lens of the camera, such that when image content is far from the focal plane, the effects of optical blur are spread across the two “views” of each dual-pixel (Fig. 3(a, DP data)). The sum of the two views accounts for all the light going through the aperture and is equal to the ordinary full-pixel image that would be captured by a non dual-pixel sensor. As a result, the disparity between the two views in a dual-pixel image is proportional to what the defocus blur size would be in an equivalent full-pixel image. Dual-pixel sensors are commonly used within consumer cameras to aid in auto-focus: the camera iteratively estimates disparity from the dual-pixels in some focus region and moves the lens until that disparity is zero, resulting in an image in which the focus region is in focus.
While dual-pixel imagery can be thought of as a stereo pair with a tiny baseline, it differs from stereo in several ways. The views are perfectly synchronized (both spatially and temporally) and have the same exposure and white balance. In addition, the two views in DP images have different point-spread functions that can encode additional depth information. Traditional stereo matching techniques applied to dual-pixel data will not only ignore the additional depth information provided by focus cues, but may even fail in out-of-focus regions due to the effective PSFs of the two views being so different that conventional image matching fails (Figs. 1(d)-1(f)). As an additional complication, the relationship between depth and disparity in dual-pixel views depends not only on the baseline between the two views, but also on the focus distance. Thus, unlike depth from stereo, which has only a scale ambiguity if the extrinsics are unknown, depth from dual-pixel data has both scale and offset ambiguities if the camera’s focus distance is unknown (as is the case for most current consumer cameras, such as those we use). Addressing the ambiguity caused by this unknown scale and offset is critical when learning to estimate depth from dual-pixel imagery, and is a core contribution of this work. As we will demonstrate, for a network to successfully learn from dual-pixel imagery, it will need to be made aware of this affine ambiguity.
We will now derive the relationship between depth, disparity, and blur size according to the paraxial and thin-lens approximations. Consider a scene consisting of point light sources located at coordinates in camera space. As stated previously, the disparity of one such point on the image plane is proportional to the (signed) blur size , where the sign is determined by whether the light source is in front or behind the focal plane. Therefore, from the paraxial and thin-lens approximations:
[TABLE]
where is a constant of proportionality, is the diameter of the aperture, is the focal length of the lens and is the focus distance of the camera. We make the affine relationship between inverse depth and disparity explicit in Eqn. 3 by defining image-wide constants and . This equation reflects our previous assertion that perfect knowledge of disparity and blur size only gives enough information to recover depth if the parameters , and are known. Please see the supplement for a derivation.
Eqn. 3 demonstrates the aforementioned affine ambiguity in dual-pixel data. This means that different sets of camera parameters and scene geometries can result in identical dual-pixel images (Fig. 3(b)). Specifically, two sets of camera parameters can result in two sets of affine coefficients and such that the same image-plane disparity is produced by two different scene depths
[TABLE]
Consumer smartphone cameras are not reliable in recording camera intrinsic metadata [15], thereby eliminating the easiest way that this ambiguity could be resolved. But Eqn. 3 does imply that it is possible to use DP data to estimate some (unknown) affine transform of inverse depth. This motivates our technique of training a CNN to estimate inverse depth only up to an affine transformation.
Though absolute depth would certainly be preferred over an affine-transformed depth, the affine-transformed depth that can be recovered from dual-pixel imagery is of significant practical use. Because affine transformations are monotonic, an affine-transformed depth still allows for reasoning about relative ordering of scene depths. Affine-invariant depth is a natural fit for synthetic defocus (simulating wide aperture images by applying a depth dependent blur to a narrow aperture image [9, 55]) as the affine parameters naturally map to the user controls — the depth to focus at, and the size of the aperture to simulate. Additionally, this affine ambiguity can be resolved using heuristics such as the likely sizes of known objects [30], thereby enabling the many uses of metric depth maps.
4 View supervision for Affine Invariant Depth
A common approach for training monocular depth estimation networks from multi-view data is to use self supervision. This is typically performed by warping an image from one viewpoint to the other according to the estimated depth and then using the difference between the warped image and the actual image as some loss to be minimized. Warping is implemented using a differentiable spatial transformer layer [33] that allows end-to-end training using only RGB views and camera poses. Such a loss can be expressed as:
[TABLE]
Where is the RGB image of interest, is a corresponding stereo image, is the (inverse) depth estimated by a network for , is the warp induced on pixel coordinates by that estimated depth and by the known camera poses, and is some arbitrary function that scores the per-pixel difference between two of RGB values. will be defined in Sec. 6.2, but for our current purposes it can be any differentiable penalty. Because we seek to predict inverse depth up to an unknown affine transform, the loss in Eqn. 5 cannot be directly applied to our case. Hence, we introduce two different methods of training with view supervision while predicting inverse depth up to an affine ambiguity.
4.1 3D Assisted Loss
If we assume that we have access to a ground truth inverse depth and corresponding per-pixel confidences for that depth, we can find the unknown affine mapping by solving
[TABLE]
While training our model , during each evaluation of our loss we solve Eqn. 6 using a differentiable least squares solver (such as the one included in TensorFlow) to obtain and , which can be used to obtain absolute depth that can then be used to compute a standard view supervision loss. Note that since we only need to solve for two scalars, a sparse ground truth depth map with a few confident depth samples suffices.
4.2 Folded Loss
Our second strategy does not require ground truth depth and folds the optimization required to solve the affine parameters into the overall loss function. We associate variables and with each training example and define our loss function as:
[TABLE]
and then let the gradient descent optimize for , and by solving
[TABLE]
To avoid degeneracies as approaches zero, we parameterize where . We initialize and from a uniform distribution in . To train this model, we simply construct one optimizer instance in which , , and are all treated as free variables and optimized over jointly.
5 Data Collection
To train and evaluate our technique, we need dual-pixel data paired with ground-truth depth information. We therefore collected a large dataset of dual-pixel images captured in a custom-made capture rig in which each dual-pixel capture is accompanied by 4 simultaneous images with a moderate baseline, arranged around the central camera (Figure 4(a)). We compute “ground truth” depths by applying established multi-view geometry techniques to these 5 images. These depths are often incomplete compared to those produced by direct depth sensors, such as the Kinect or LIDAR. However, such sensors can only image certain kinds of scenes — the Kinect only works well indoors, and it is difficult to acquire LIDAR scans of scenes that resemble normal consumer photography. Synchronization and registration of these sensors with the dual-pixel images is also cumbersome. Additionally, the spatial resolutions of direct depth sensors are far lower than the resolutions of RGB cameras. Our approach allows us to capture a wide variety of high-resolution images, captured both indoors and outdoors, that resemble what people typically capture with their cameras: pets, flowers, etc (we do not include images of faces in our dataset, due to privacy concerns). The plus-shaped arrangement means that it is unlikely that a pixel in the center camera is not visible in at least one other camera (barring small apertures or very nearby objects) thereby allowing us to recover accurate depths even in partially-occluded regions. The cameras are synchronized using the system of [5], thereby allowing us to take photos from all phones at the same time (within milliseconds, or half a frame) which allows us to reliably image moving subjects. Though the inherent difficulty of the aperture problem means that our ground-truth depths are rarely perfect, we are able to reliably recover high-precision partial depth maps, in which high-confidence locations have accurate depths and inaccurate depths are flagged as low-confidence (Figures 4(b), 4(c)). To ensure that our results are reliable and not a function of some particular stereo algorithm, we compute two separate depth maps (each with an associated confidence) using two different algorithms: the established COLMAP stereo technique [50, 1], and a technique we designed for this task. See the supplement for a detailed description.
Our data is collected using a mix of two widely available consumer phones with dual-pixels: The Google Pixel 2 and the Google Pixel 3. For each capture, all 5 images are collected using the same model of phone. We captured scenes resulting in RGB and DP images. Our photographer captured a wide variety of images that reflect the kinds of photos people take with their camera, with a bias towards scenes that contain interesting nearby depth variation, such as a subject that is 0.5 - 2 meters away. Though all images contain RGB and DP information, for this work we only use the DP signal of the center camera. All other images are treated as conventional RGB images. We process RGB and DP images at a resolution of , but compute “ground truth” depth maps at half this resolution to reduce noise. We use inverse perspective sampling in the range 0.2 - 100 meters to convert absolute depth to inverse depth . Please see the supplement for more details.
Though our capture rig means that the relative positions of our 5 cameras are largely fixed, and our synchronization means that our sets of images are well-aligned temporally, we were unable to produce a single fixed intrinsic and extrinsic calibration of our camera rig that worked well across all sets of images. This is likely due to the lens not being fixed in place in the commodity smartphone cameras we use. As a result, the focus may drift due to mechanical strain or temperature variation, the lens may jitter off-axis while focusing, and optical image stabilization may move the camera’s center of projection [15]. For this reason, we use structure from motion [25] with priors provided by the rig design to solve for the extrinsics and intrinsics of the 5 cameras individually for each capture, which results in an accurate calibration for all captures. This approach introduces a variable scale ambiguity in the reconstructed depth for each capture, but this is not problematic for us as our training and evaluation procedures assume an unknown scale ambiguity.
6 Experiments
We describe our data, evaluation metrics and method of training our CNN for depth prediction. In addition, we compare using affine-invariant losses to using scale-invariant and ordinary losses and demonstrate that affine-invariant losses improve baseline methods for predicting depth from dual-pixel images.
6.1 Data Setup
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] COLMAP. https://colmap.github.io/ .
- 2[2] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, et al. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015.
- 3[3] Edward H Adelson and James R Bergen. The plenoptic function and the elements of early vision. Computational Models of Visual Processing , 1991.
- 4[4] Edward H Adelson and John YA Wang. Single lens stereo with a plenoptic camera. TPAMI , 1992.
- 5[5] Sameer Ansari, Neal Wadhwa, Rahul Garg, and Jiawen Chen. Wireless software synchronization of multiple distributed cameras. ICCP , 2019.
- 6[6] Sam Y. Bae, Ronald J. Korniski, Michael Shearn, Harish M. Manohara, and Hrayr Shahinian. 4-mm-diameter three-dimensional imaging endoscope with steerable camera for minimally invasive surgery (3-D-MARVEL). Neurophotonics , 2016.
- 7[7] Ruzena Bajcsy and Lawrence Lieberman. Texture gradient as a depth cue. CGIP , 1976.
- 8[8] Jonathan T. Barron. A general and adaptive robust loss function. CVPR , 2019.
