# Pix2Vox: Context-aware 3D Reconstruction from Single and Multi-view   Images

**Authors:** Haozhe Xie, Hongxun Yao, Xiaoshuai Sun, Shangchen Zhou, Shengping, Zhang

arXiv: 1901.11153 · 2020-06-02

## TL;DR

Pix2Vox is a novel deep learning framework that improves 3D object reconstruction from images by using a context-aware fusion module, leading to better accuracy, consistency, and speed compared to previous methods.

## Contribution

The paper introduces Pix2Vox, a new framework with a context-aware fusion module that enhances 3D reconstruction quality and speed, overcoming RNN limitations.

## Key findings

- Outperforms state-of-the-art methods on ShapeNet and Pix3D benchmarks.
- Achieves 24 times faster inference than 3D-R2N2.
- Demonstrates strong generalization to unseen categories.

## Abstract

Recovering the 3D representation of an object from single-view or multi-view RGB images by deep neural networks has attracted increasing attention in the past few years. Several mainstream works (e.g., 3D-R2N2) use recurrent neural networks (RNNs) to fuse multiple feature maps extracted from input images sequentially. However, when given the same set of input images with different orders, RNN-based approaches are unable to produce consistent reconstruction results. Moreover, due to long-term memory loss, RNNs cannot fully exploit input images to refine reconstruction results. To solve these problems, we propose a novel framework for single-view and multi-view 3D reconstruction, named Pix2Vox. By using a well-designed encoder-decoder, it generates a coarse 3D volume from each input image. Then, a context-aware fusion module is introduced to adaptively select high-quality reconstructions for each part (e.g., table legs) from different coarse 3D volumes to obtain a fused 3D volume. Finally, a refiner further refines the fused 3D volume to generate the final output. Experimental results on the ShapeNet and Pix3D benchmarks indicate that the proposed Pix2Vox outperforms state-of-the-arts by a large margin. Furthermore, the proposed method is 24 times faster than 3D-R2N2 in terms of backward inference time. The experiments on ShapeNet unseen 3D categories have shown the superior generalization abilities of our method.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1901.11153/full.md

## Figures

11 figures with captions in the complete paper: https://tomesphere.com/paper/1901.11153/full.md

## References

36 references — full list in the complete paper: https://tomesphere.com/paper/1901.11153/full.md

---
Source: https://tomesphere.com/paper/1901.11153