PEAR: Pixel-aligned Expressive humAn mesh Recovery
Jiahao Wu, Yunfei Liu, Lijian Lin, Ye Zhu, Lei Zhu, Jingyi Li, Yu Li

TL;DR
PEAR is a fast, robust framework that reconstructs detailed 3D human meshes from single images in real-time, improving fine-grained pose and facial expression accuracy using pixel-level supervision and a ViT-based model.
Contribution
The paper introduces PEAR, a unified ViT-based model that achieves real-time 3D human mesh recovery with enhanced detail and robustness, addressing limitations of prior SMPLX-based methods.
Findings
Achieves over 100 FPS inference speed.
Significantly improves pose estimation accuracy.
Effectively captures facial expressions and fine details.
Abstract
Reconstructing detailed 3D human meshes from a single in-the-wild image remains a fundamental challenge in computer vision. Existing SMPLX-based methods often suffer from slow inference, produce only coarse body poses, and exhibit misalignments or unnatural artifacts in fine-grained regions such as the face and hands. These issues make current approaches difficult to apply to downstream tasks. To address these challenges, we propose PEAR-a fast and robust framework for pixel-aligned expressive human mesh recovery. PEAR explicitly tackles three major limitations of existing methods: slow inference, inaccurate localization of fine-grained human pose details, and insufficient facial expression capture. Specifically, to enable real-time SMPLX parameter inference, we depart from prior designs that rely on high resolution inputs or multi-branch architectures. Instead, we adopt a clean and…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
1. By jointly regressing SMPLX (body) and FLAME (head) parameters under the Expressive Human Model (EHM), the presented method unifies coarse pose estimation with fine-grained facial expressiveness, which is more practical than the SMPLX-only methods. 2. Real-time inference (0.05 seconds per frame) from a single 256×192 image, without cropping or high-resolution input, is practically valuable for downstream animation tasks and interactive applications. 3. The construction of a large-scale datase
1. From the article, especially the contribution of the introduction part, it is unclear how the method achieves the promising results. Much of the framework builds upon GUAVA and HMR2, with the main innovation being the introduction of pixel-level supervision. Despite of integrating known components, what fundamentally new representational or algorithmic insight does PEAR introduce? Is the gain mainly from adding photometric loss? 2. The paper focuses on alignment but does not address the limit
- The method is straightforward and easy to understand; the manuscript is clearly written. - The proposed method could perform full-body 3D modeling without the need for cropping. - A large-scale human mesh dataset is annotated and slated for open release.
- Technical novelty. The proposed pipeline closely follows GUAVA: (a) it adopts the enhanced human parametric model EHM (introduced by GUAVA); (b) in Stage-1, EHM parameters are trained using pseudo ground truth generated by GUAVA; (c) in Stage-2, the neural renderer reuses GUAVA’s pipeline. Hence, compared with GUAVA, the difference is that instead of tracking-based EHM parameter estimation, this paper swaps in an HMR2-based parameter estimator and then jointly optimizes EHM estimation and the
- Paper is well-written and easy to understand. - By combining SMPLX and FLAME within the Expressive Human Model and using photometric supervision, the framework effectively captures subtle facial expressions and hand detail, outperforming recent works on human mesh recovery.
- The core design of this work that combines an EHM (SMPLX + FLAME) with a neural renderer for pixel-level photometric supervision, closely follows the formulation of GUAVA (Zhang et al., ICCV 2025). While PEAR extends GUAVA’s upper-body focus to full-body reconstruction and adopts a two-stage training pipeline instead of optimization-based parameter tracking, the overall architecture and objective remain conceptually similar. The paper should clarify the key algorithmic differences or technica
1. Pixel-level photometric supervision: The second-stage neural rendering significantly improves fine-grained alignment beyond joint/parameter losses.
1. FLAME head pose integration: The paper states the proposed system estimates both SMPL-X and FLAME parameters, but it is unclear how global head orientation is consistently maintained. For example, when the body is rotated or facing away, naïvely replacing the head could cause inconsistencies between the body and face orientation. Clarification is needed on how alignment between body root pose and FLAME global pose is enforced. 2. Stage-2 reliance on upper-body datasets: The second-stage trai
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Face recognition and analysis · 3D Shape Modeling and Analysis
