Direct Reward Fine-Tuning on Poses for Single Image to 3D Human in the Wild
Seunguk Do, Minwoo Huh, Joonghyuk Shin, Jaesik Park

TL;DR
This paper introduces DrPose, a novel fine-tuning method that improves 3D human pose reconstruction from single images by leveraging a new pose-focused dataset and a differentiable reward to enhance pose accuracy without needing extensive 3D data.
Contribution
The paper presents DrPose, a post-training fine-tuning approach that uses a new pose dataset and a differentiable reward to improve 3D human reconstruction in challenging poses.
Findings
Significant improvement in pose accuracy on benchmarks.
Enhanced reconstruction quality for dynamic and challenging poses.
Effective use of pose data without requiring 3D assets.
Abstract
Single-view 3D human reconstruction has achieved remarkable progress through the adoption of multi-view diffusion models, yet the recovered 3D humans often exhibit unnatural poses. This phenomenon becomes pronounced when reconstructing 3D humans with dynamic or challenging poses, which we attribute to the limited scale of available 3D human datasets with diverse poses. To address this limitation, we introduce DrPose, Direct Reward fine-tuning algorithm on Poses, which enables post-training of a multi-view diffusion model on diverse poses without requiring expensive 3D human assets. DrPose trains a model using only human poses paired with single-view images, employing a direct reward fine-tuning to maximize PoseScore, which is our proposed differentiable reward that quantifies consistency between a generated multi-view latent image and a ground-truth human pose. This optimization is…
Peer Reviews
Decision·ICLR 2026 Poster
- Very well motivated — recent diffusion-based methods fail on complicated poses (e.g., acrobatic poses). - Well-designed and effective RL-based reward model that is differentiable. - Strong improvement over SOTA methods (SiTH, PSHuman, etc.). - Thoroughly studied recent literature. - The DRPose15k dataset provides extreme-pose ground-truth SMPL-image pairs, which are valuable to the community. The idea of using a pose-conditioned video diffusion model to create this dataset is brilliant, as
- Colors seem washed out; I have noticed this in recent diffusion-based single-view human reconstruction methods. Why is this happening, and how can it be improved? - Reproducibility concern: As suggested in the “Suggestions” section, it would be great if architectural details could be included in the appendix of the final revision. - Otherwise, I am happy with the current submission.
1.Well-motivated focus on pose quality limitations in existing approaches, with quantitative evidence (1.73× larger pose diversity in DRPOSE15K vs THuman2.1) 2.Creative data construction: Leveraging motion capture + video generation to avoid expensive 3D scanning.
Circular dependency in data quality: DRPOSE15K relies on MIMO to generate training images, creating a potential bottleneck. If MIMO produces unrealistic appearances for extreme poses, the model learns from flawed data. This isn't adequately discussed or validated. Insufficient reward model analysis: no ablation on reward formulation alternatives (e.g., direct 3D keypoint prediction, discriminator-based rewards). Insufficient failure cases are shown, it is hard to evaluate, but the geometry sho
- Comprehensive Motivation and Contribution: The paper starts with a clear motivation, highlighting the limitations of existing datasets and evaluation protocols for single-image 3D human reconstruction. To address these, the authors propose a systematic set of contributions: a novel post-training method (POSESCORE-driven fine-tuning), a new training dataset (DRPOSE15K), and a tailored evaluation benchmark (MIXAMORP). The contributions are well-aligned with the stated motivation and provide a h
- Pipeline Novelty: While the paper provides a systematic solution, the overall pipeline lacks substantial innovation. The framework is primarily built on an image-to-multi-view (I2MV) diffusion model followed by existing 3D reconstruction techniques. The pipeline primarily follows previous designs(Like PSHuman), with the main novelty lying in the reward-based fine-tuning (POSESCORE) and the dataset contributions. - Limited Impact of Individual Contributions: The reward function POSESCORE, wh
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Generative Adversarial Networks and Image Synthesis · Human Motion and Animation
