Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation
Ziwen Li, Bo Xu, Han Huang, Cheng Lu, Yandong Guo

TL;DR
This paper introduces DTS-VIBE, a novel two-stream transformer-based framework that fuses RGB and optical flow to improve the stability and accuracy of 3D human pose and shape estimation from videos.
Contribution
It presents a multi-modality approach combining RGB and optical flow with a transformer network for enhanced 3D human reconstruction from videos.
Findings
Outperforms state-of-the-art methods on Human3.6 and 3DPW datasets.
Improves temporal consistency and accuracy in 3D pose estimation.
Utilizes optical flow to leverage motion information between frames.
Abstract
Several video-based 3D pose and shape estimation algorithms have been proposed to resolve the temporal inconsistency of single-image-based methods. However it still remains challenging to have stable and accurate reconstruction. In this paper, we propose a new framework Deep Two-Stream Video Inference for Human Body Pose and Shape Estimation (DTS-VIBE), to generate 3D human pose and mesh from RGB videos. We reformulate the task as a multi-modality problem that fuses RGB and optical flow for more reliable estimation. In order to fully utilize both sensory modalities (RGB or optical flow), we train a two-stream temporal network based on transformer to predict SMPL parameters. The supplementary modality, optical flow, helps to maintain temporal consistency by leveraging motion knowledge between two consecutive frames. The proposed algorithm is extensively evaluated on the Human3.6 and 3DPW…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Advanced Vision and Imaging · Hand Gesture Recognition Systems
