PoseCraft: Tokenized 3D Body Landmark and Camera Conditioning for Photorealistic Human Image Synthesis

Zhilin Guo; Jing Yang; Kyle Fogarty; Jingyi Wan; Boqiao Zhang; Tianhao Wu; Weihao Xia; Chenliang Zhou; Sakar Khattar; Fangcheng Zhong; Cristina Nader Vasconcelos; Cengiz Oztireli

arXiv:2602.19350·cs.CV·February 24, 2026

PoseCraft: Tokenized 3D Body Landmark and Camera Conditioning for Photorealistic Human Image Synthesis

Zhilin Guo, Jing Yang, Kyle Fogarty, Jingyi Wan, Boqiao Zhang, Tianhao Wu, Weihao Xia, Chenliang Zhou, Sakar Khattar, Fangcheng Zhong, Cristina Nader Vasconcelos, Cengiz Oztireli

PDF

Open Access

TL;DR

PoseCraft introduces a diffusion-based framework that encodes 3D landmarks and camera parameters as tokens, enabling photorealistic human image synthesis with explicit pose and camera control, surpassing prior methods in quality and detail.

Contribution

The paper proposes a novel tokenized 3D conditioning approach for diffusion models, avoiding 2D re-projection issues and improving photorealism in human image synthesis.

Findings

01

Achieves higher perceptual quality than existing diffusion methods.

02

Maintains detailed fabric and hair features better than volumetric SOTA.

03

Handles large pose and viewpoint variations effectively.

Abstract

Digitizing humans and synthesizing photorealistic avatars with explicit 3D pose and camera controls are central to VR, telepresence, and entertainment. Existing skinning-based workflows require laborious manual rigging or template-based fittings, while neural volumetric methods rely on canonical templates and re-optimization for each unseen pose. We present PoseCraft, a diffusion framework built around tokenized 3D interface: instead of relying only on rasterized geometry as 2D control images, we encode sparse 3D landmarks and camera extrinsics as discrete conditioning tokens and inject them into diffusion via cross-attention. Our approach preserves 3D semantics by avoiding 2D re-projection ambiguity under large pose and viewpoint changes, and produces photorealistic imagery that faithfully captures identity and appearance. To train and evaluate at scale, we also implement GenHumanRF, a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Face recognition and analysis