# Semantic Estimation of 3D Body Shape and Pose using Minimal Cameras

**Authors:** Andrew Gilbert, Matthew Trumble, Adrian Hilton, John Collomosse

arXiv: 1908.03030 · 2020-09-08

## TL;DR

This paper presents a method for estimating 3D human body shape and pose from minimal multi-view video using a symmetric 3D convolutional network, achieving improved accuracy and generalization.

## Contribution

It introduces a novel multi-channel 3D encoder-decoder with dual loss for joint pose and shape estimation from as few as two views, with a learned prior for better generalization.

## Key findings

- Improved reconstruction accuracy over prior methods.
- Lower pose estimation error on benchmark datasets.
- Effective generalization to unseen subjects and actions.

## Abstract

We aim to simultaneously estimate the 3D articulated pose and high fidelity volumetric occupancy of human performance, from multiple viewpoint video (MVV) with as few as two views. We use a multi-channel symmetric 3D convolutional encoder-decoder with a dual loss to enforce the learning of a latent embedding that enables inference of skeletal joint positions and a volumetric reconstruction of the performance. The inference is regularised via a prior learned over a dataset of view-ablated multi-view video footage of a wide range of subjects and actions, and show this to generalise well across unseen subjects and actions. We demonstrate improved reconstruction accuracy and lower pose estimation error relative to prior work on two MVV performance capture datasets: Human 3.6M and TotalCapture.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1908.03030/full.md

## Figures

5 figures with captions in the complete paper: https://tomesphere.com/paper/1908.03030/full.md

## References

47 references — full list in the complete paper: https://tomesphere.com/paper/1908.03030/full.md

---
Source: https://tomesphere.com/paper/1908.03030