DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers
Lizhen Wang, Zhurong Xia, Tianshu Hu, Pengrui Wang, Pengfei Wei, Zerong Zheng, Ming Zhou, Yuan Zhang, Mingyuan Gao

TL;DR
DreamActor-H1 introduces a diffusion transformer framework for generating high-fidelity human-product demonstration videos that accurately preserve identities, details, and spatial relationships, enhancing realism and consistency.
Contribution
The paper presents a novel diffusion transformer model that integrates paired reference information, motion guidance, and structured text encoding for realistic human-product video synthesis.
Findings
Outperforms state-of-the-art methods in identity preservation.
Generates realistic and consistent demonstration motions.
Effectively models human-product spatial relationships.
Abstract
In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis
MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer · Diffusion
