# Deep Multitask Architecture for Integrated 2D and 3D Human Sensing

**Authors:** Alin-Ionut Popa, Mihai Zanfir, Cristian Sminchisescu

arXiv: 1701.08985 · 2017-02-01

## TL;DR

This paper introduces a deep multitask neural network architecture that simultaneously performs 2D and 3D human pose recognition and reconstruction from monocular images, leveraging multi-dataset training for improved accuracy.

## Contribution

The novel architecture enables joint training of recognition and reconstruction tasks using diverse datasets, improving robustness and accuracy in 2D and 3D human sensing from monocular images.

## Key findings

- Achieves state-of-the-art results on multiple datasets.
- Performs comparably to RGB-D systems in real-world scenarios.
- Supports joint training across datasets with different annotations.

## Abstract

We propose a deep multitask architecture for \emph{fully automatic 2d and 3d human sensing} (DMHS), including \emph{recognition and reconstruction}, in \emph{monocular images}. The system computes the figure-ground segmentation, semantically identifies the human body parts at pixel level, and estimates the 2d and 3d pose of the person. The model supports the joint training of all components by means of multi-task losses where early processing stages recursively feed into advanced ones for increasingly complex calculations, accuracy and robustness. The design allows us to tie a complete training protocol, by taking advantage of multiple datasets that would otherwise restrictively cover only some of the model components: complex 2d image data with no body part labeling and without associated 3d ground truth, or complex 3d data with limited 2d background variability. In detailed experiments based on several challenging 2d and 3d datasets (LSP, HumanEva, Human3.6M), we evaluate the sub-structures of the model, the effect of various types of training data in the multitask loss, and demonstrate that state-of-the-art results can be achieved at all processing levels. We also show that in the wild our monocular RGB architecture is perceptually competitive to a state-of-the art (commercial) Kinect system based on RGB-D data.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1701.08985/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/1701.08985/full.md

## References

49 references — full list in the complete paper: https://tomesphere.com/paper/1701.08985/full.md

---
Source: https://tomesphere.com/paper/1701.08985