Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation

Zhenghao Zhang; Junchao Liao; Xiangyu Meng; Long Qin; Weizhi Wang

arXiv:2507.05963·cs.CV·July 10, 2025

Tora2: Motion and Appearance Customized Diffusion Transformer for Multi-Entity Video Generation

Zhenghao Zhang, Junchao Liao, Xiangyu Meng, Long Qin, Weizhi Wang

PDF

Open Access

TL;DR

Tora2 is a novel diffusion transformer model that enables simultaneous customization of appearance and motion for multiple entities in video generation, improving detail preservation and multimodal alignment.

Contribution

It introduces a decoupled personalization extractor, gated self-attention, and a contrastive loss, enabling multi-entity appearance and motion customization in video synthesis.

Findings

01

Achieves competitive performance with state-of-the-art methods.

02

Enhances motion control capabilities in multi-entity video generation.

03

Improves visual detail preservation and multimodal alignment.

Abstract

Recent advances in diffusion transformer models for motion-guided video generation, such as Tora, have shown significant progress. In this paper, we present Tora2, an enhanced version of Tora, which introduces several design improvements to expand its capabilities in both appearance and motion customization. Specifically, we introduce a decoupled personalization extractor that generates comprehensive personalization embeddings for multiple open-set entities, better preserving fine-grained visual details compared to previous methods. Building on this, we design a gated self-attention mechanism to integrate trajectory, textual description, and visual information for each entity. This innovation significantly reduces misalignment in multimodal conditioning during training. Moreover, we introduce a contrastive loss that jointly optimizes trajectory dynamics and entity consistency through…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Face recognition and analysis · Human Motion and Animation

MethodsDiffusion