HunyuanCustom: A Multimodal-Driven Architecture for Customized Video   Generation

Teng Hu; Zhentao Yu; Zhengguang Zhou; Sen Liang; Yuan Zhou; Qin Lin,; Qinglin Lu

arXiv:2505.04512·cs.CV·May 9, 2025

HunyuanCustom: A Multimodal-Driven Architecture for Customized Video Generation

Teng Hu, Zhentao Yu, Zhengguang Zhou, Sen Liang, Yuan Zhou, Qin Lin,, Qinglin Lu

PDF

Open Access 1 Repo 1 Models

TL;DR

HunyuanCustom is a multimodal framework for customized video generation that maintains subject identity and supports various input modalities, significantly improving realism and alignment over existing methods.

Contribution

The paper introduces HunyuanCustom, a novel multimodal architecture that enhances identity consistency and multi-modal input support in customized video generation.

Findings

01

Outperforms state-of-the-art methods in ID consistency and realism.

02

Supports image, audio, video, and text conditioned generation.

03

Demonstrates robustness across multiple downstream tasks.

Abstract

Customized video generation aims to produce videos featuring specific subjects under flexible user-defined conditions, yet existing methods often struggle with identity consistency and limited input modalities. In this paper, we propose HunyuanCustom, a multi-modal customized video generation framework that emphasizes subject consistency while supporting image, audio, video, and text conditions. Built upon HunyuanVideo, our model first addresses the image-text conditioned generation task by introducing a text-image fusion module based on LLaVA for enhanced multi-modal understanding, along with an image ID enhancement module that leverages temporal concatenation to reinforce identity features across frames. To enable audio- and video-conditioned generation, we further propose modality-specific condition injection mechanisms: an AudioNet module that achieves hierarchical alignment via…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

tencent-hunyuan/hunyuancustom
pytorch

Models

🤗
tencent/HunyuanCustom
model· ♡ 191
♡ 191

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Face recognition and analysis