PHD: Personalized 3D Human Body Fitting with Point Diffusion
Hsuan-I Ho, Chen Guo, Po-Chen Wu, Ivan Shugurov, Chengcheng Tang, Abhay Mittal, Sizhe An, Manuel Kaufmann, Linguang Zhang

TL;DR
PHD introduces a personalized 3D human body fitting method that uses a shape-conditioned point diffusion model to improve pose accuracy from videos, requiring only synthetic data and integrating easily with existing systems.
Contribution
The paper presents a novel shape-conditioned 3D pose prior using a Point Diffusion Transformer, enhancing pose estimation accuracy by incorporating user-specific body shape information.
Findings
Improves pelvis-aligned and absolute pose accuracy.
Highly data-efficient, trained solely on synthetic data.
Can be integrated with existing 3D pose estimators.
Abstract
We introduce PHD, a novel approach for personalized 3D human mesh recovery (HMR) and body fitting that leverages user-specific shape information to improve pose estimation accuracy from videos. Traditional HMR methods are designed to be user-agnostic and optimized for generalization. While these methods often refine poses using constraints derived from the 2D image to improve alignment, this process compromises 3D accuracy by failing to jointly account for person-specific body shapes and the plausibility of 3D poses. In contrast, our pipeline decouples this process by first calibrating the user's body shape and then employing a personalized pose fitting process conditioned on that shape. To achieve this, we develop a body shape-conditioned 3D pose prior, implemented as a Point Diffusion Transformer, which iteratively guides the pose fitting via a Point Distillation Sampling loss. This…
| Method | MPJPE | MPJPE-PA | MVE | MVE-PA |
|---|---|---|---|---|
| ScoreHMR [67] Sample init. | 114.0 | 82.3 | 141.3 | 101.9 |
| PHD (Ours) Sample init. | 73.6 | 49.2 | 86.4 | 59.1 |
| HMR2.0b [15] init. | 117.2 | 77.9 | 140.2 | 93.9 |
| w/ SMPLify [3] | - | 83.5(+5.6) | - | - |
| w/ ScoreHMR [67] | - | 76.5(-1.4) | - | - |
| w/ ScoreHMR* [67] | 105.5(-11.7) | 70.0(-7.9) | 124.5(-15.7) | 84.7(-9.2) |
| w/ PHD (Ours) | 73.2(-44.0) | 47.4(-30.5) | 86.4(-45.8) | 58.5(-35.4) |
| CameraHMR [54] init. | 70.3 | 43.3 | 81.7 | - |
| w/ ScoreHMR* [67] | 74.9(+4.6) | 45.0(+1.7) | 89.0(+7.3) | 54.5 |
| w/ PHD (Ours) | 62.5(-7.8) | 42.4(-0.9) | 74.6(-7.1) | 51.6 |
| Method | Pelvis Err.(mm) | C-MPJPE(mm) |
|---|---|---|
| ScoreHMR [67] Sample init. | 154.3 | 154.4 |
| PHD (Ours) Sample init. | 91.5 | 115.8 |
| HMR2.0b [15] init. | 144.0 | 182.0 |
| w/ ScoreHMR* [67] | 180.6 (+36.6) | 181.4 (-0.6) |
| w/ PHD (Ours) | 94.7 (-49.3) | 112.6 (-69.4) |
| CameraHMR [54] init. | 163.0 | 160.3 |
| w/ ScoreHMR* [67] | 154.3(-5.7) | 154.4(-5.9) |
| w/ PHD (Ours) | 130.9 (-32.1) | 135.6 (-27.4) |
| Joint Error (mm) | Vertex Error (mm) | |||
| Method | Mean | Max | Mean | Max |
| Zero Shape | 28.41 | 52.72 | 29.84 | 60.18 |
| ScoreHMR [67] Mean | 29.07 | 55.37 | 29.32 | 60.23 |
| CameraHMR [54] | 30.60 | 57.28 | 31.85 | 62.09 |
| TokenHMR [12] | 25.15 | 53.36 | 27.46 | 50.49 |
| SHAPY [7] | 22.94 | 41.02 | 21.38 | 44.82 |
| NLF [61] | 19.36 | 36.37 | 20.61 | 41.46 |
| SHAPify (Ours) w/o M. | 13.97 | 29.30 | 14.25 | 33.44 |
| SHAPify (Ours) | 11.29 | 20.97 | 9.18 | 21.91 |
| Method | Shape Err. J/V | MPJPE | MVE | C-MPJPE | Pelvis Err. |
|---|---|---|---|---|---|
| Mean Shape | 29.1 / 29.3 | 81.0 | 94.6 | 177.3 | 163.6 |
| Zero Shape | 28.4 / 29.8 | 79.8 | 93.2 | 170.8 | 156.8 |
| SHAPify w/ M. | 11.3 / 9.2 | 73.6 | 86.4 | 115.8 | 91.5 |
| SHAPify w/o M. | 13.9 / 14.2 | 73.2 | 86.0 | 116.3 | 95.0 |
| GT Shape | - | 72.5 | 85.2 | 110.4 | 84.7 |
| w/o Point Distillation | - | 75.8 | 91.7 | 111.4 | 83.3 |
| w/ Point Distillation | - | 72.5 | 85.2 | 110.4 | 84.7 |
| Method |
|
|
|
|
||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| IK (24 joints) [81] | 67.8 | 49.3 | 81.9 | 60.3 | ||||||||
| 25% Points | 63.8 | 45.3 | 78.1 | 59.0 | ||||||||
| 50% Points | 63.2 | 44.7 | 73.7 | 53.6 | ||||||||
| 100% Points | 62.6 | 44.4 | 72.9 | 53.1 |
| Method | MPJPE | MPJPE-PA | MVE | MVE-PA |
|---|---|---|---|---|
| ScoreHMR [67] Sample init. | 94.3 | 66.6 | 122.3 | 94.7 |
| PHD (Ours) Sample init. | 84.2 | 51.0 | 100.4 | 67.6 |
| HMR2.0b [15] init. | 81.7 | 54.2 | 93.5 | 67.7 |
| w/ SMPLify [3] | - | 60.1(+5.9) | - | - |
| w/ ScoreHMR [67] | - | 51.1(-3.1) | - | - |
| w/ ScoreHMR* [67] | 75.7(-6.0) | 51.2(-3.0) | 87.5(-6.0) | 65.4(-2.3) |
| w/ PHD (Ours) | 80.5(-1.2) | 44.4(-9.8) | 93.3(-0.2) | 57.3(-10.4) |
| CameraHMR [54] init. | 62.1 | 38.5 | 72.9 | - |
| w/ ScoreHMR* [67] | 59.6(-2.5) | 38.0(-0.5) | 71.5(-1.4) | 51.1 |
| w/ PHD (Ours) | 59.4(-2.7) | 37.5(-1.0) | 71.3(-1.6) | 50.9 |
| EMDB [31] | 3DPW [73] | |||||||
| Method | MPJPE | MPJPE-PA | MVE | MVE-PA | MPJPE | MPJPE-PA | MVE | MVE-PA |
| HMR2.0b [15] | 117.4 | 78.0 | 140.5 | 94.0 | 81.8 | 54.4 | 93.5 | 67.8 |
| PARE [34] | 113.9 | 72.2 | 133.2 | 85.4 | 74.5 | 46.5 | 88.6 | - |
| HMR2.0a [15] | 98.3 | 60.7 | 120.8 | - | 69.8 | 44.4 | 82.2 | - |
| TokenHMR [12] | 88.1 | 49.8 | 104.2 | - | 70.5 | 43.8 | 86.0 | - |
| CameraHMR [54] | 70.3 | 43.3 | 81.7 | - | 62.1 | 38.5 | 72.9 | - |
| NLF [61] | 68.4 | 40.9 | 80.6 | 51.1 | 59.0 | 36.5 | 69.7 | 48.8 |
| LGD [66] | 115.8 | 81.1 | 140.6 | 95.7 | - | 59.8 | - | - |
| ReFit [74] | 88.0 | 58.6 | 104.5 | - | 65.3 | 40.5 | 75.1 | - |
| WHAM* [65] | 79.7 | 50.4 | 94.4 | - | 57.8* | 35.9* | 68.7* | - |
| ScoreHMR [67] (CameraHMR init.) | 74.9 | 45.0 | 89.0 | 54.5 | 59.6 | 38.0 | 71.5 | 51.1 |
| Ours (Sample init.) | 73.6 | 49.2 | 86.4 | 59.1 | 84.2 | 51.0 | 100.4 | 67.6 |
| Ours (HMR2.0b init.) | 73.2 | 47.4 | 86.4 | 58.5 | 80.5 | 44.4 | 93.3 | 57.3 |
| Ours (CameraHMR init.) | 62.5 | 42.4 | 74.6 | 51.6 | 59.4 | 37.5 | 71.3 | 50.9 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · 3D Shape Modeling and Analysis · Human Motion and Animation
PHD: Personalized 3D Human Body Fitting with Point Diffusion
Hsuan-I Ho1,3
Chen Guo1,3
Po-Chen Wu3
Ivan Shugurov3
Chengcheng Tang3
Abhay Mittal3
Sizhe An3
Manuel Kaufmann
Linguang Zhang
1Department of Computer Science, ETH Zürich
2ETH AI Center, ETH Zürich
3Reality Labs, Meta
Abstract
We introduce PHD, a novel approach for personalized 3D human mesh recovery (HMR) and body fitting that leverages user-specific shape information to improve pose estimation accuracy from videos. Traditional HMR methods are designed to be user-agnostic and optimized for generalization. While these methods often refine poses using constraints derived from the 2D image to improve alignment, this process compromises 3D accuracy by failing to jointly account for person-specific body shapes and the plausibility of 3D poses. In contrast, our pipeline decouples this process by first calibrating the user’s body shape and then employing a personalized pose fitting process conditioned on that shape. To achieve this, we develop a body shape-conditioned 3D pose prior, implemented as a Point Diffusion Transformer, which iteratively guides the pose fitting via a Point Distillation Sampling loss. This learned 3D pose prior effectively mitigates errors arising from an over-reliance on 2D constraints. Consequently, our approach improves not only pelvis-aligned pose accuracy but also absolute pose accuracy – an important metric often overlooked by prior work. Furthermore, our method is highly data-efficient, requiring only synthetic data for training, and serves as a versatile plug-and-play module that can be seamlessly integrated with existing 3D pose estimators to enhance their performance. Code and models are available at https://PHD-Pose.github.io.
\dagger$$\daggerfootnotetext: Equal advisory
1 Introduction
Perceiving accurate 3D human pose and shape is fundamental for future AI systems, such as realistic human avatars for AR/VR telepresence, personalized service robots, and human behavior understanding. Impressive strides have recently been made in 3D human pose and shape estimation [3, 28, 15, 67, 54, 61, 36, 33]. Most of these methods are designed to be subject-agnostic, focusing on broad generalization, and are trained on large corpora of 3D pseudo-labels relying on 2D reprojection objectives. We identify two main problems in existing work.
First, they miss a key opportunity to enhance 3D pose accuracy by leveraging the consistency of user identity over time. Typically, these methods estimate body shape, pose, and pelvis position (in camera coordinates) simultaneously per frame. However, the shape parameters they predict often vary across frames, contradicting the assumption that a subject’s body shape does not change in a short video. This entanglement of pose and shape can lead to undesirable situations where changes in pose must make up for shape errors to satisfy the optimization objective. Second, existing methods tend to overly rely on 2D objectives to achieve pose-to-image alignment. This often comes at the expense of 3D accuracy. This problem affects optimization-based methods, but also regressors that are trained on 3D pseudo-data, which has been obtained by fitting initial 3D pose estimates with 2D objectives.
As a result of these two problems, existing body fitting pipelines often fail to simultaneously ensure the accuracy of body shape, body pose, and pelvis position (see Fig. 2). In this work, we address these limitations through an innovative personalized 3D body pose estimation method, as opposed to existing generalized methods.
We demonstrate that explicitly considering user-specific information is essential for achieving consistently accurate body pose estimation over time. Our approach decomposes the body pose estimation problem into two stages: a shape estimation stage, referred to as personalization, and a pose fitting stage that uses the shape as a conditional input. To perform personalization, we introduce SHAPify, a body shape fitting method that calibrates a user’s shape parameters from a single video frame showing a reference pose and optionally uses as little as the height and weight of the person as additional identity information. SHAPify provides highly accurate shape estimations compared to existing generalized pose and shape estimators.
The subsequent pose fitting stage is inspired by existing methods [3, 36, 66, 38, 67] that follow a “regress-then-refine” scheme. These methods begin by jointly estimating shape, pose, and pelvis position using a regressor, followed by an optimization process that aims to minimize 2D alignment errors using 2D keypoint constraints. Although widely used in pose refinement [3, 36, 66, 38, 67] and avatar reconstruction [24, 17, 51, 18], this approach is limited. A strong reliance on 2D objectives is problematic as they may easily degrade 3D pose accuracy due to the neglect of depth information (e.g., see Fig. 6). Furthermore, existing methods are often at the mercy of the provided initializations, i.e., they often cannot recover from a bad initialization. We argue that a good 3D prior should be able to do so, i.e., improve the bad parts of the initialization while retaining the good ones.
These limitations motivate us to incorporate a novel, strong 3D pose prior into the “regress-then-refine” fitting process, called PointDiT. The 3D pose prior encourages the pose output to remain within the manifold of plausible poses, preventing overfitting to 2D observations. At the same time, it is less reliant on good initializations than previous work and can correct bad initializations. More specifically, we formulate the prior as a body shape-conditioned point diffusion model that learns to sample body surface points. We demonstrate that the use of surface points as the pose representation more effectively reconstructs uncommon poses compared to the use of joint angle-based counterparts [67, 5]. PointDiT is then iteratively applied to guide the pose fitting process via a newly designed Point Distillation Sampling loss. By integrating the 3D pose prior into the fitting loop, we effectively constrain the plausibility of 3D poses when optimizing with 2D alignment objectives. It is worth noting that PointDiT is efficiently trained only with the synthetic dataset BEDLAM [2], and can be easily applied to off-the-shelf 3D pose estimators as a versatile plug-and-play module. In our experiments, we demonstrate further accuracy improvements over our closest related work ScoreHMR [67], and establish a new state of the art on EMDB [31] for both pelvis-aligned and absolute pose accuracy – a metric often underemphasized in prior work. In summary, our contributions are as follows:
- •
A new personalized body pose estimation paradigm, PHD, that utilizes user-specific shape information to improve pose accuracy by decoupling the estimation of body shape, pose, and pelvis position.
- •
A robust body shape-conditioned 3D body pose prior based on a point diffusion model, PointDiT, which we use with a Point Distillation Sampling loss to provide effective 3D pose guidance in personalized body fitting.
- •
We demonstrate that PHD is versatile, accurate, and improves pose initializations supplied by strong baselines.
2 Related Work
3D Human pose estimation and body fitting.
Estimating 3D human pose and shape from monocular inputs has been extensively studied, ranging from optimization-based approaches to recent transformer-based regressors. Early optimization-based methods [59, 19, 76] performed 3D pose tracking by fitting pre-scanned body templates to video sequences. However, obtaining such templates demands specialized capturing equipment and is not easy to scale up. Therefore, later work resorts to fitting a parametric body model [45, 55, 75, 26] to 2D observations, such as landmarks [3, 55, 75], masks [52], and body part segmentations [39]. While optimization-based methods achieve promising fitting results, they often involve lengthy optimization loops and are prone to overfitting 2D objectives. Learning-based approaches, instead, directly regress the parametric body model from images [28, 36, 34, 40, 10, 69, 70, 4, 42, 68, 61], videos [33, 29, 71, 65], or estimate human bodies as mesh vertices [43, 37, 62, 6, 8]. Recent advancements successfully employed transformer-based architectures [9, 77] for learning-based 3D pose estimation and achieved state-of-the-art performance [15, 12, 60, 54, 4, 68]. To further enhance the capability of transformer models, 3D pseudo ground-truth annotations [27, 50, 15, 41] are widely used for training. While this ensures better 2D pose-to-image alignment, their 3D pose plausibility is reduced.
Recently, a new line of work [66, 74, 67, 48] combines both methods by first estimating the initial human body and then further refining it. The recent ScoreHMR [67] is our closest related work. Note that these methods are all generalized models designed to be subject-agnostic. Different from the generalized methods, our goal is to leverage only minimal personal information [48] (i.e., heights and weights) to achieve more accurate and personalized pose and shape estimation.
3D human pose prior.
3D pose priors are essential for numerous tasks, e.g., fitting 3D human poses to images and videos [58, 47, 3, 55, 72], modeling pose ambiguity [63, 11], and inpainting body poses [30, 21]. Existing pose priors can be broadly categorized into two types: unconditional and conditional ones. For instance, early unconditional pose priors focused on learning joint limits [1] to avoid implausible poses. Recently, pose priors leveraging Gaussian Mixture Models [3], Generative Adversarial Networks [14, 28], VAEs [55, 12], and implicit neural functions [72, 20], are employed to impose unconditioned pose priors for model training and fitting. However, many of these unconditional priors are biased towards common training samples and do not generalize well to unseen poses. Of late, the community has embraced conditional prior models, e.g., ProHMR [38], GenHMR [60], and ScoreHMR [67], where the prior is trained to condition on an input image. Training with more diverse visual data, these methods demonstrated better generality to unseen images and poses.
Building on the foundations of previous conditional prior models, we develop a generative model that captures the conditional distribution of body poses given an input image and the human body shape. Previously, ScoreHMR and Cho et al. [5] used angular parameters to represent body poses during training. Our experiments (Sec. 4.4) suggest that this approach is suboptimal for image and shape conditioning, due to the weak correlation between angular parameters and conditional inputs. Inspired by related work that has recognized denser representations [16, 80, 61], we instead build our prior with 3D points sampled from the body surface. This is a natural choice given that we condition on a 3D body shape, where 3D body points are more closely related to than angular parameters are. We integrated this novel point-based pose prior into the pose fitting process for robust personalized 3D pose estimation.
3 Methodology
Preliminaries.
We use the SMPL [45] parametric body model to represent 3D body shape and pose. The body shape is parameterized by PCA coefficients , which SMPL maps to a person-specific mesh in the reference T-pose. The body pose is composed of 23 relative joint rotations using the axis-angle representation, along with pelvis orientation in the camera coordinate system. Finally, we additionally define as the relative position of SMPL pelvis to the camera center (origin).
Method overview.
Our goal is to estimate the shape parameters , bone rotations , per-frame pelvis orientation , and location from a monocular RGB video of a moving human, given the camera’s focal length. Our personalized body fitting pipeline comprises two main steps, as illustrated in Fig. 3. First, we extract person-specific shape information by estimating the shape parameters from a calibration frame where the person is approximately in a rest pose (e.g., T-pose or I-pose). (Sec. 3.1). This step is performed only once for each subject. In the second body fitting step, a point diffusion model (Sec. 3.2) that learns to sample body points conditioned on the image and shape parameters is trained to serve as a 3D prior for optimizing the body pose. We then iteratively sample body points and refine the pose parameters by considering both 2D keypoint projections and 3D prior guidance (Sec. 3.3).
3.1 Personalized Body Shape Estimation
Obtaining accurate body shape from a single image is an ill-posed problem due to occlusions by clothing and scale/pose ambiguity, which poses challenges for existing data-driven methods. Inspired by SHAPY [7] and SMPLify [3], we develop an optimization-based method, SHAPify, to reliably obtain shape parameters from a single calibration image, 2D keypoints obtained from an off-the-shelf detector, and optional body measurements (see Fig. 3 (a)).
Shape calibration setup.
To estimate an accurate body shape from a single image, we capture the subject in a rest pose. In the standard 3D pose and shape estimation problem, shape parameters , bone rotations , pelvis orientation , and position are all unknown variables. We initialize and using the predefined rest pose and prioritize changes in and over changes in and during optimization by adjusting the learning rates.
Optimization objectives.
We minimize the following objective:
[TABLE]
where is the SMPL differentiable LBS function for obtaining 3D joints, is the 3D-to-2D projection, and is the 2D keypoints obtained from a keypoint detector. We choose smaller learning rates for and to encourage changes in shape parameters and pelvis orientation .
2D keypoints alone do not provide sufficient constraints on the body shape, as the keypoint projection depends on both the pelvis’ pitch angle and . To mitigate this ambiguity, we impose additional regularization terms to ensure the human shape follows specific body measurements:
[TABLE]
Here, are average human height and weight, which can be replaced by user-provided measurements if available, and are differentiable functions that compute the height and weight from a SMPL body mesh (see Supp. Mat. Sec. 6.1). The final objective then is:
[TABLE]
3.2 Shape-conditioned Point Diffusion Priors
Diffusion model and rectified flow.
Diffusion models iteratively perturb and denoise data samples across several diffusion steps [22]. Given a sample , the forward process gradually blends the sample with Gaussian noise to obtain noisy samples . To sample new data, a reverse process is applied to recover clean data from random noise, whereby the noise at step is estimated by a neural network. However, diffusion models typically require denoising steps for optimal performance, which is expensive for a pose fitting procedure. Recently, several works [44, 13] incorporated the rectified flow formulation into diffusion models to reduce the number of denoising steps. Here, the forward process is considered a simple linear interpolation between sample and noise :
[TABLE]
By replacing the variable, we can obtain , where represents the “flow”. Similar to the standard diffusion model, a reverse process for sampling can also be represented with a neural network:
[TABLE]
where predicts the flow at time given . The training loss follows conditional flow matching:
[TABLE]
We follow the reweighing and scheduling of Stable Diffusion 3 [13] to train our model with the loss.
Point diffusion transformer.
Our objective is to develop a generative model that captures the conditional distribution of body poses [38, 67] given an input image and personalized body shape. We call our novel human body prior PointDiT – it is a 3D point diffusion model designed to sample body points. The detailed architecture of PointDiT is illustrated in Supp. Mat. Sec. 6.2. We adapt the original Diffusion Transformer architecture [56] with the following modifications: (1) We extract image tokens and 2D heatmaps from ViTPose [77] to construct conditional tokens for self-attention conditioning. (2) We replace the class embedding with shape parameters in adaLN-Zero conditioning. (3) We adopt the rectified flow scheduling and re-weighting from SD3 [13]. During training, noisy body points are generated by perturbing ground-truth points with random noises , as described in Eq. 4. The flow-based formulation allows us to promptly sample point clouds in as few as denoising steps, which is crucial for integrating PointDiT into the fitting procedure.
Model training.
Our body point clouds are made up of mesh vertices and joints from the SMPL model. We select based on the accuracy of the SMPL fitter in NLF [61]. The choice of and allows for accurate conversion of point clouds to SMPL pose parameters while still ensuring efficient training (see Supp. Mat. Fig. 10). We train our model on the synthetic BEDLAM [2] dataset as it provides high-quality images with ground-truth shape and pose parameters. During each training iteration, we extract conditional image features from the cropped image and extract points from the ground-truth mesh. These points are linearly blended with Gaussian noise at a randomly sampled time step , following Eq. 4. The PointDiT model is then trained to predict the rectified flow at time step , given as conditions, using the loss function in Eq. 6.
3.3 Prior-guided Body Fitting
Point distillation sampling.
We begin by detailing how PointDiT can serve as a pose prior for guiding the body fitting process. Inspired by the concept of Score Distillation Sampling [57], we introduce Point Distillation Sampling, an iterative process that leverages PointDiT to guide the fitting process. As illustrated in Fig. 3 (b), a noisy point cloud is denoised using PointDiT to produce a clean point cloud , following the procedure outlined in Eq. 5. Using the sampled point cloud, we compute two losses to enforce the 3D prior. The first loss is calculated as the pelvis-aligned L2 error between the sampled point clouds and the corresponding points derived from the fitted body parameters:
[TABLE]
Here, denotes the calibrated shape parameters obtained from Sec. 3.1, and is a differentiable Linear Blend Skinning function used to compute the selected points from SMPL parameters. The second loss, , uses the Point Fitter [61] to convert the point cloud back to pose parameters , and penalizes the difference:
[TABLE]
where denotes the weights balancing each term.
Sampling-fitting in the loop.
The objective function of the fitting process, similar to existing methods, comprises a data term and a prior term:
[TABLE]
where the data term minimizes 2D keypoint errors:
[TABLE]
The prior term is defined as . Unlike traditional fitting methods that use a fixed data prior throughout the optimization process, our approach incorporates a sampling-and-refinement loop to iteratively enhance the sampled point clouds for 3D guidance. This is, at iteration , the parameters are updated based on Eq. 9 (see Fig. 3 (c)). Then, are used to generate a new set of points: . Subsequently, is perturbed with a small noise level () using Eq. 4, and a new point cloud is resampled for the next fitting iteration.
Pose initialization.
Initialization is crucial in optimization problems. Unlike previous fitting methods that heavily rely on initializations from learned generalized regressors, we leverage PointDiT to sample plausible initial body poses. We also show that our model is compatible out-of-the-box with most existing 3D pose regressors. For a comprehensive evaluation and discussion, please see Sec. 4.2.
4 Experiments
4.1 Experimental Setup
Dataset.
BEDLAM [2] is a synthetic dataset comprising 1M+ training images paired with ground-truth SMPL parameters. BEDLAM is the only dataset we use for training the shape-conditioned PointDiT model. EMDB [31] features in-the-wild videos of 10 subjects, with SMPL poses captured using precise electromagnetic sensors. It also includes accurate 3D global camera trajectories provided by the Apple AR Toolkit, along with ground-truth SMPL shapes derived from 3D scans. All the experiments and ablation studies are evaluated on the EMDB1 split. In Supp. Mat., we also present results on the 3DPW [73] dataset, which captures daily life human performance with IMUs.
Evaluation protocol.
In the 3D pose estimation literature, the accuracy of pelvis-aligned (local) poses is typically evaluated using the MPJPE(-PA) and MVE(-PA) metrics. However, these metrics do not consider pelvis position error, which is crucial for determining absolute pose accuracy in the camera coordinate system. Therefore, we also report the Pelvis Error, which measures the distance between the predicted and ground-truth pelvis positions. Following the approach in SPEC [35], we additionally report the absolute MPJPE metric in the camera coordinate system, referred to as C-MPJPE. To evaluate shape prediction accuracy, we report per-vertex errors of body meshes in the rest pose.
4.2 3D Pose and Shape Accuracy
Pelvis-aligned pose accuracy.
The problem setting of ScoreHMR [67] closely aligns with our method. Therefore, we conduct a thorough comparison by providing ScoreHMR with identical inputs and the ground-truth focal length as used in our approach. The quantitative results are presented in Tab. 1. We evaluate the methods using three different pose initialization strategies: (1) initializing by sampling a pose using the learned priors (referred to as Sample init.), (2) initializing with HMR2.0b predictions, which is used by ScoreHMR (referred to as HMR2.0b init.), and (3) initializing with CameraHMR predictions, the state-of-the-art generalized model trained on the same dataset as our method (referred to as CameraHMR init.).
As shown in Tab. 1, our method outperforms ScoreHMR significantly when initialized from poses generated by the prior. When initializing with HMR2.0b poses, which often produce implausible bending poses (see Fig. 6), our PointDiT effectively corrects these erroneous poses and achieves a significantly larger improvement than ScoreHMR. Furthermore, even when initialized with the SOTA pose regressor, CameraHMR, we observe a notable enhancement in the non-PA metrics, whereas ScoreHMR degrades the results. This shows that incorporating user-specific shape information effectively reduces ambiguity in recovering body scale and orientation (i.e., in absolute 3D poses).
For completeness, we also supplement results on 3DPW (Sec. 7.3) and compare our performance to purely learning-based models (Sec. 7.4). Our method either surpasses the state-of-the-art or ranks second-best, despite baseline methods often being trained on multiple datasets (e.g., NLF [61]). On the 3DPW dataset, we observe similar performance gains as on EMDB, albeit less pronounced.
Absolute pose accuracy.
To fairly compare absolute pose accuracy in the camera coordinate system (i.e., without any alignment step), we provide all methods with the ground-truth camera focal length. The results are summarized in Tab. 2. Interestingly, methods that achieve accurate pelvis-aligned poses (low MPJPE in Tab. 1) do not necessarily ensure absolute pose accuracy (high C-MPJPE/Pelvis Err.), e.g., CameraHMR. We believe that this is due to the overfitting of pelvis-aligned objectives during training. While deterministic regressors can achieve lower errors in terms of local pose, the estimated shape and pelvis position might not always be plausible in 3D. C-MPJPE is a more suitable metric for evaluating the 3D pose accuracy in real-world scenarios, and our personalized fitting method has demonstrated improved performance on this metric.
Shape accuracy.
To evaluate the accuracy of SHAPify, we use T-pose input images of 10 subjects from the EMDB dataset and compare our method with shapes extracted from SOTA methods. We report SHAPify’s accuracy on EMDB in Tab. 3. SHAPify consistently outperforms all baselines. Even on a simple T-pose image, existing methods predict shapes with significant errors, particularly when subjects wear loose-fitting clothing, such as jackets. By incorporating user-specific measurements into the fitting process, we effectively address this ill-conditioned problem. Please find more results in Supp. Mat. Sec. 7.1.
4.3 Comparisons of Conditional 3D Body Prior
To verify the robustness of using PointDiT as the 3D body prior, we compare it with the pose prior in ScoreHMR. The results are presented in Fig. 4 and Tab. 4. We first analyze the accuracy of image-conditioned body pose sampling. In Fig. 4 (left), ScoreHMR’s prior fails to sample plausible body poses from input images of challenging poses, whereas our method produces plausible body poses that match the images. ScoreHMR requires relatively accurate pose initializations, as evident from our experiments and Fig. 4 (right). When initializing with HMR2.0, which can produce implausible 3D poses, ScoreHMR struggles to recover from it due to the ambiguity of 2D keypoint reprojection. In contrast, PointDiT provides effective 3D guidance during fitting, helping to correct implausible 3D poses.
4.4 Ablation Studies
Point cloud vs. joint rotation representation.
To evaluate the benefits of using point clouds as the 3D body representation over joint rotations, we trained a variant of the PointDiT model. This variant maintains the same input conditions and network architecture but outputs 6D joint rotations [5] (referred to as 6D Angular). The results presented in Fig. 5 and Tab. 4 indicate that under identical image conditions, the 6D joint rotations result in greater errors in sampled poses, particularly for uncommon poses. We attribute this to the weak correlation between 2D image features and joint rotations. ScoreHMR addresses this issue by utilizing features extracted from a deep layer of a learned 3D pose regressor [34]. While these features are more closely related to joint rotations, they still suffer from the regressor’s limitations of being less effective on uncommon poses. For example, in Fig. 4 (left), ScoreHMR cannot sample a reasonable pose for images of out-of-distribution body poses. In contrast, by using 2D image features with the point cloud representation, our model can handle uncommon poses more effectively (see Fig. 4 Ours Sampling).
Effectiveness of personalization.
We analyze how body shape affects the accuracy of pose fitting. In Tab. 5, we evaluate four different shape configurations: zero shape, per-subject mean shape (from ScoreHMR), fitted shape from SHAPify (w/ and w/o measurements), and the ground-truth shape. Surprisingly, using the mean shape results in poorer fitting performance compared to using the zero shape. Note that averaging the shape is a common practice [17, 65, 67]. Moreover, more accurate body shapes not only improve pelvis-aligned poses but also enhance absolute pose accuracy. We visualize this effect in Supp. Mat. Sec. 7.6.
Effectiveness of point distillation.
Finally, we demonstrate the efficacy of point distillation sampling in body fitting. In Fig. 6, we highlight a common issue of knee bending observed in many existing methods (here HMR2.0b). Such poses may exhibit minimal 2D keypoint projection errors but are highly implausible in 3D. Relying solely on 2D visual cues for pose refinement is insufficient to resolve this issue. As shown in Fig. 6 (right), incorporating point distillation sampling rectifies this problem, ensuring both 3D plausibility and 2D alignment. Consequently, the accuracy of local poses is largely improved (see Tab. 5 (bottom)).
5 Conclusion
PHD represents a step forward from generic models towards personalized 3D human body recovery. By decoupling the traditional regression pipeline into a shape calibration and a body fitting procedure, PHD effectively leverages user-specific identity information to achieve more accurate and robust 3D pose estimates. The core of our method is a powerful shape-conditioned 3D prior, implemented as a point-based diffusion transformer, which guides the body fitting process. Our results showcase that PHD is not only highly versatile but also holds potential to drive future human-centric perceptual AI systems.
Acknowledgements.
This work was partially supported by the Swiss SERI Consolidation Grant ”AI-PERCEIVE”.
6 Implementation Details
6.1 SHAPify Details
In the optimization of SHAPify, we initialize the pose parameters as the rest pose (T-/I-pose) and the pelvis position to:
[TABLE]
where is the camera focal length and are the pelvis pixels and camera center on the image, respectively. is the depth of the pelvis, which we approximate as . Note that is the shoulder width of the SMPL mean shape, and is the length of shoulder keypoints on the 2D image. This approximation holds because the horizontal line on the image is not affected by the pitch angles of global orientation, and we can assume that the roll and yaw angles of the pelvis orientation are typically small in the frame of rest poses. We also initialize the roll and yaw angles to [math] and update them with small learning rates.
In the regularization term of SHAPify, we calculate body heights and weights from the SMPL body meshes. The heights () are calculated as the distance between the top of the head (Vertex 411) to the center of feet (Vertex 3439, 6839). The heights () are the volume of human body meshes multiplied by the body density (985 ). We use , , and if the body measurements are available, and , , and if not. Without using the body measurements, our method achieved a 14mm joint error and a 13mm vertex error (Tab. 3), which is only slightly higher than using the body measurements. Moreover, we found the body weight regularization term crucial. Without such a constraint, the fitting is prone to converge to shapes with large bellies. (See Fig. 7)
6.2 PointDiT Architecture and Training
We show the detailed network architecture of PointDit in Fig. 11. More specifically, the PointDiT model contains 20 DiT blocks and is operated at a dimension of 512. We use the frozen ViTBackbone [77] to extract image features of the shape of and heatmaps of the shape of . We add the image features, heatmaps, and positional embeddings together to obtain the final conditional features c. Our body point clouds are made up of 45 SMPL joints and 238 vertices from the SMPL surface (see Fig. 10). This was chosen by the accuracy of the Point Fitter in NLF [61]. The point clouds used for diffusion are of the shape , and we normalize the points to zero mean and unit variance before adding noise. We use the rectified flow formulation [44, 13] in the diffusion model to reduce the number of denoising steps during inference, as described in the main paper. Both conditional features and point clouds are projected to 512-dimensional tokens and fed into the transformer. In the output layer, we project the tokens back to 3-dimensional points and de-normalize them. The image conditional tokens are only used for self-attention conditioning and are discarded in the final layer.
We train our PointDiT model using the synthetic BEDLAM dataset. We apply standard data augmentations [2, 12] to the provided image crops and ground-truth annotations. Our training consists of two stages. In the first stage, we set all conditional shape parameters to zero, allowing the model to focus on sampling body point clouds corresponding to the conditional images. In the second stage, we learn the correct body shape of point clouds using the ground-truth shape parameters as conditions. For training, we utilize a batch size of 512 images and set the learning rate to with the AdamW optimizer. The training takes approximately 1 day for the first stage, with 12K iterations, and another 2 days for the second stage, with 30K iterations, on 8 NVIDIA V100 GPUs. The training scheduler and reweighting factors follow the same configuration as in Stable Diffusion 3 [13]. We employ a dropout rate of 0.05 for the image and shape conditioning.
6.3 Point Distillation and Body Fitting Details
In the body fitting stage, we can initialize pose and global orientation with either our sampled points or the results from a regressor. To initialize the pelvis position , we solve the weighted least squares problem by plugging in to Eq. 10. Afterward, we optimize 100 iterations per image with a learning rate of for optimizing and for optimizing with the AdamW optimizer. The is set to 1.0 and the is set to 100, (, , ) are set to (0.1, 0.1, 1.0). Every 10 iterations, we resample the points again from the fitted meshes.
6.4 Inference
We extract 2D keypoints using Sapiens [32], and we pre-process keypoints and bounding boxes (image crops) before inference. SHAPify is a lightweight optimization algorithm that can be run on a CPU and takes roughly 1 second for each image. For body fitting inference, we tested our model on a single NVIDIA RTX 3090 GPU. We set the number of denoising steps to , and it takes approximately 1 second per frame without any parallelization (batch size ). As a reference, ScoreHMR [67] requires denoising steps and takes about 3 seconds per frame under the same data setup.
7 More Experimental Results
7.1 Shape Estimation
We visualize the results of SHAPify against existing methods on EMDB in Fig. 9. Even on a simple T-pose image, existing methods predict shapes with significant errors, particularly when subjects wear loose-fitting clothing, such as jackets.
EMDB mainly contains subjects with moderate body shape. Since real-world data of extreme body shapes is limited, in Fig. 8, we visualize the SHAPify and PHD’s fitting results on a newly recorded slim subject and another high-BMI subject from the HBW [7] dataset. Because PointDiT is trained on BEDLAM, which contains body shapes across a wide BMI range (17.5 to 42.5), our method is able to generalize to human subjects with diverse body shapes.
7.2 Effectiveness of Point Fitter
In Tab. 6, we analyze the effect of the point cloud size by retaining 25% and 50% of the points for fitting. Furthermore, Inverse Kinematics (IK) [81, 48] is commonly used for fitting 3D joints back to the parameter space. We also design a joint-only baseline by replacing the Point Fitter with an open-source IK solver [81]. The joint-only IK is less accurate than Point Fitter and incurs expensive optimization loops that slow down the method (15 seconds per frame). In contrast, the Point Fitter is faster but requires more points to ensure accuracy. When only 25% of points are retained, we observe a clear increase in MVE. Overall, our design is both efficient and accurate.
7.3 Results on 3DPW
We conduct similar evaluation in Sec. 4.2 on the 3DPW dataset [73]. Note that for this evaluation benchmark, we don’t have access to the body measurements of the subjects, and we use the estimated shapes for evaluation. Overall, we observed similar performance improvements as on EMDB, but less pronounced. We identified two major issues with this benchmark.
First, as noted in EMDB [31], performance on 3DPW has saturated due to its limited pose diversity, which motivates the need for EMDB as a diverse and challenging benchmark. Many recent methods report trends consistent with ours, i.e., more minor improvements on 3DPW compared to EMDB. Second, unlike EMDB, which uses EM sensors and SLAM to reconstruct accurate 3D camera trajectories and body poses, 3DPW estimates them through a joint optimization process based on 2D keypoints and IMU sensors. This approach introduces systematic errors of approximately 26 mm, as reported in the original paper, resulting in a biased pose distribution. Since this distribution deviates from that of synthetic training data, the learned prior becomes less effective during the fitting stage. Consequently, recent methods [61, 65] often finetune on the 3DPW training set to better align with its data distribution. Overall, we believe that consistent performance gains on EMDB without finetuning are more reliable indicators of robustness to diverse and challenging poses.
7.4 Comparison to Learning-Based Methods
We extensively compare our method against learning-based approaches in Tab. 8. Notably, our model is trained exclusively on the synthetic BEDLAM dataset (1M images), whereas other methods are trained on a combination of real-world datasets such as Human3.6M [23], MPI-INF-3DHP [49], and in the case of NLF [61], over 40 datasets.
On the EMDB benchmark, when initializing our optimization with poses randomly sampled from PointDiT, we observe slightly higher errors compared to CameraHMR. This is likely due to domain gaps between the synthetic training images and real test-time images. However, when using CameraHMR for pose initialization, our method either outperforms the state-of-the-art (NLF [61]) or comes in a close second. On the 3DPW dataset, initializing with PointDiT-sampled poses yields suboptimal performance, primarily due to the biased pose distribution discussed in Sec. 7.3 and the domain mismatch between the training and test images. To address this, we utilize CameraHMR for pose initialization, achieving the second-best performance on 3DPW. It is worth noting that the (PA-)MPJPE values on 3DPW have saturated and are close to the reported systematic error (26mm). As such, it is difficult to determine whether improvements reflect actual accuracy gains or overfitting to inherent dataset biases. Consequently, we argue that the evaluations on EMDB provide a more meaningful reflection of true pelvis-aligned 3D pose accuracy.
7.5 DiTPose Sampling
In Fig. 12, we visualize the point clouds denoising process of our PointDiT model. Our model leverages the rectified flow formulation to train a diffusion model, enabling sampling of body point clouds in as few as 5 denoising steps. In Fig. 13, we demonstrate the effectiveness of shape conditioning in our PointDiT model. Given the same input images but with different conditional shape parameters, our model generates diverse body shapes that correspond to the body poses described in the input images.
7.6 Effect of Shape on Pelvis Position
In Fig. 15, we illustrate how an incorrect body shape adversely affects pelvis positioning. Using incorrect shape parameters, such as the mean shape, leads to using the wrong bone lengths for body fitting. This will not only affect the accuracy of the local pose but also the pelvis positions in the camera coordinate. Please also refer to our video for a better visualization of this effect.
7.7 More qualitative results
We present more qualitative results with ScoreHRM [67] in Fig. 17 and WHAM [65] in Fig. 18. In Fig. 14, we also showcase results of in-the-wild human performance capturing using our method with a modern smartphone.
8 Discussion
8.1 Applications
Our method is beneficial for in-the-wild avatar reconstruction [76, 17, 18, 51]. Existing avatar reconstruction methods all require accurate pose fitting as a starting point, with the most common practice being to use the averaged shape from a regressor and refine the poses using 2D keypoints or by jointly optimizing poses with appearance cues [46]. However, these strategies are only effective when pose errors are small and correctable. As discussed earlier, generalized regressors can easily produce implausible 3D poses, which prevent the avatar from learning accurate pose-dependent appearances. Therefore, PHD offers a simple and effective solution by providing more accurate pose and shape inputs for in-the-wild avatar reconstruction. Furthermore, we believe that accurate personal shape information is a key factor for future avatar-centric applications, such as 3D virtual try-on and digital human interaction.
8.2 Absolute Pose Metrics
In Sec. 4.2, we emphasized the importance of absolute pose accuracy in the camera coordinate system and showed that most state-of-the-art pose regressors failed to achieve desirable results. Recently, a new line of work [78, 79, 65, 25, 64] has pursued a similar idea in a slightly different setup by estimating body poses and trajectories in the world coordinate system. Their evaluation metric, G-MPJPE, is determined by two factors: (1) how well the human poses and motion over time are estimated, and (2) how accurately the camera pose trajectories are recovered. Consequently, these approaches usually involve full-video SLAM tracking and global optimization over body and camera poses. This metric is more suitable for video-based methods.
In contrast, our proposed camera-coordinate metric, C-MPJPE, is better suited for per-frame or online methods that can be used for on-device computations (like ours). C-MPJPE disentangles the error caused by camera poses and evaluates only the accuracy of the body pose. Because personal shape (scale) information is fixed in our problem setting, C-MPJPE is also correlated with 2D alignment accuracy, as the 2D projection is definitive when the body scale is fixed. It is also worth noting that C-MPJPE can be converted to G-MPJPE by accounting for the camera extrinsics over time.
8.3 Limitations and Future Work
Hands and faces.
Currently, our work focuses on fitting SMPL parameters to videos due to the limited availability of evaluation benchmarks for in-the-wild human pose estimation. It is worth noting that both our method and the PoseDiT model can be easily extended to any format of parametric models, such as SMPL-X. One of our future goals is to extend PointDiT to handle hand gestures and facial expressions. This can be readily achieved by selecting the corresponding data from the BEDLAM training set in SMPL-X format. We believe that for future perceptual AI systems, estimating holistic human body poses with precise hand and facial information will be indispensable.
2D keypoint detection.
Although PointDiT is an effective 3D prior, our fitting method still depends on the accuracy of 2D keypoint detections. When these detections fail significantly (see Fig. 16), our method struggles to recover. An exciting direction for future work is to explore appearance- or feature-based fitting by leveraging powerful, pre-trained foundation models such as DINOv2 [53]. This approach could make the fitting process more robust to challenging poses that 2D keypoint detectors may fail.
Temporal Smoothness.
Our method and the evaluations in this paper follow a per-frame setting. This setup is closer to real-world on-device computation, allowing for a fair assessment of the model’s performance without relying on future or past information. While the metrics appear promising, we observed noticeable jittering across frames. Additionally, our method does not hallucinate continuous body motion when parts of the body are occluded or unobserved. To address these issues, a promising direction is to extend PointDiT into a temporal model that incorporates motion priors rather than predicting a single pose at a time. However, training such a model would require even more data beyond what BEDLAM provides. We believe that video generative AI can help address this challenge by offering a wide range of realistic motion data.
Learning-based optimization.
Our method can run at 1 FPS on a common GPU (RTX 3090). While this is already faster than most existing pose fitting methods, it is still not yet suitable for real-time pose tracking. This limitation stems from the nature of optimization loops. Even though PointDiT provides strong 3D guidance during the fitting process, it still requires several iterations to converge. One way to mitigate these long optimization loops is to employ a ”learning to optimize” approach [66, 8], which enables the network to memorize optimization trajectories. Potentially, it achieves faster inference speeds through a feed-forward network while retaining the accuracy of optimization.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Akhter and Black [2015] Ijaz Akhter and Michael J Black. Pose-conditioned joint angle limits for 3d human pose reconstruction. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1446–1455, 2015.
- 2Black et al. [2023] Michael J. Black, Priyanka Patel, Joachim Tesch, and Jinlong Yang. BEDLAM: A synthetic dataset of bodies exhibiting detailed lifelike animated motion. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 8726–8737, 2023.
- 3Bogo et al. [2016] Federica Bogo, Angjoo Kanazawa, Christoph Lassner, Peter Gehler, Javier Romero, and Michael J. Black. Keep it SMPL: Automatic estimation of 3D human pose and shape from a single image. In Proceedings of the European Conference on Computer Vision (ECCV) , 2016.
- 4Cai et al. [2023] Zhongang Cai, Wanqi Yin, Ailing Zeng, Chen Wei, Qingping Sun, Wang Yanjun, Hui En Pang, Haiyi Mei, Mingyuan Zhang, Lei Zhang, Chen Change Loy, Lei Yang, and Ziwei Liu. SMP Ler-X: Scaling up expressive human pose and shape estimation. In Advances in Neural Information Processing Systems (Neur IPS) , 2023.
- 5Cho and Kim [2023] Hanbyel Cho and Junmo Kim. Generative approach for probabilistic human mesh recovery using diffusion models. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV) Workshops , pages 4183–4188, 2023.
- 6Cho et al. [2022] Junhyeong Cho, Kim Youwang, and Tae-Hyun Oh. Cross-attention of disentangled modalities for 3d human mesh recovery with transformers. In Proceedings of the European Conference on Computer Vision (ECCV) , pages 342–359. Springer, 2022.
- 7Choutas et al. [2022] Vasileios Choutas, Lea Müller, Chun-Hao P Huang, Siyu Tang, Dimitrios Tzionas, and Michael J Black. Accurate 3d body shape regression using metric and semantic attributes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 2718–2728, 2022.
- 8Corona et al. [2022] Enric Corona, Gerard Pons-Moll, Guillem Alenyà, and Francesc Moreno-Noguer. Learned vertex descent: A new direction for 3d human model fitting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2022.
