Direct Multi-view Multi-person 3D Pose Estimation
Tao Wang, Jianfeng Zhang, Yujun Cai, Shuicheng Yan, Jiashi Feng

TL;DR
The paper introduces MvP, a novel multi-view transformer-based method for direct, efficient, and accurate multi-person 3D pose estimation from images, outperforming previous approaches on key benchmarks.
Contribution
It proposes a transformer-based framework with hierarchical query embeddings, projective attention, and RayConv, enabling direct 3D pose regression without intermediate steps.
Findings
Achieves 92.3% AP25 on Panoptic dataset, surpassing previous methods.
Outperforms state-of-the-art accuracy while being more efficient.
Extensible to human mesh recovery with SMPL model.
Abstract
We present Multi-view Pose transformer (MvP) for estimating multi-person 3D poses from multi-view images. Instead of estimating 3D joint locations from costly volumetric representation or reconstructing the per-person 3D pose from multiple detected 2D poses as in previous methods, MvP directly regresses the multi-person 3D poses in a clean and efficient way, without relying on intermediate tasks. Specifically, MvP represents skeleton joints as learnable query embeddings and let them progressively attend to and reason over the multi-view information from the input images to directly regress the actual 3D joint locations. To improve the accuracy of such a simple pipeline, MvP presents a hierarchical scheme to concisely represent query embeddings of multi-person skeleton joints and introduces an input-dependent query adaptation approach. Further, MvP designs a novel geometrically guided…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsHuman Pose and Action Recognition · Video Surveillance and Tracking Methods · Diabetic Foot Ulcer Assessment and Management
