DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

Lizhen Wang; Zhurong Xia; Tianshu Hu; Pengrui Wang; Pengfei Wei; Zerong Zheng; Ming Zhou; Yuan Zhang; Mingyuan Gao

arXiv:2506.10568·cs.CV·August 28, 2025

DreamActor-H1: High-Fidelity Human-Product Demonstration Video Generation via Motion-designed Diffusion Transformers

Lizhen Wang, Zhurong Xia, Tianshu Hu, Pengrui Wang, Pengfei Wei, Zerong Zheng, Ming Zhou, Yuan Zhang, Mingyuan Gao

PDF

Open Access

TL;DR

DreamActor-H1 introduces a diffusion transformer framework for generating high-fidelity human-product demonstration videos that accurately preserve identities, details, and spatial relationships, enhancing realism and consistency.

Contribution

The paper presents a novel diffusion transformer model that integrates paired reference information, motion guidance, and structured text encoding for realistic human-product video synthesis.

Findings

01

Outperforms state-of-the-art methods in identity preservation.

02

Generates realistic and consistent demonstration motions.

03

Effectively models human-product spatial relationships.

Abstract

In e-commerce and digital marketing, generating high-fidelity human-product demonstration videos is important for effective product presentation. However, most existing frameworks either fail to preserve the identities of both humans and products or lack an understanding of human-product spatial relationships, leading to unrealistic representations and unnatural interactions. To address these challenges, we propose a Diffusion Transformer (DiT)-based framework. Our method simultaneously preserves human identities and product-specific details, such as logos and textures, by injecting paired human-product reference information and utilizing an additional masked cross-attention mechanism. We employ a 3D body mesh template and product bounding boxes to provide precise motion guidance, enabling intuitive alignment of hand gestures with product placements. Additionally, structured text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Motion and Animation · Generative Adversarial Networks and Image Synthesis · Face recognition and analysis

MethodsAbsolute Position Encodings · Layer Normalization · Byte Pair Encoding · Label Smoothing · Softmax · Dropout · Dense Connections · Transformer · Diffusion