TL;DR
This paper introduces a novel approach combining top-down and bottom-up methods for accurate 3D multi-person pose estimation from monocular video, addressing challenges like occlusion, scale variation, and detection errors.
Contribution
It proposes an integrated framework that leverages the strengths of both approaches, including a top-down network for joint estimation, a bottom-up network with normalized heatmaps, and a test-time optimization for improved accuracy.
Findings
Effective handling of occlusion and scale variations.
Robustness to detection errors in multi-person scenes.
Improved 3D pose accuracy demonstrated in experiments.
Abstract
Monocular 3D human pose estimation has made progress in recent years. Most of the methods focus on single persons, which estimate the poses in the person-centric coordinates, i.e., the coordinates based on the center of the target person. Hence, these methods are inapplicable for multi-person 3D pose estimation, where the absolute coordinates (e.g., the camera coordinates) are required. Moreover, multi-person pose estimation is more challenging than single pose estimation, due to inter-person occlusion and close human interactions. Existing top-down multi-person methods rely on human detection (i.e., top-down approach), and thus suffer from the detection errors and cannot produce reliable pose estimation in multi-person scenes. Meanwhile, existing bottom-up methods that do not use human detection are not affected by detection errors, but since they process all persons in a scene at…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
