FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers

Xuanhua He; Quande Liu; Zixuan Ye; Weicai Ye; Qiulin Wang; Xintao Wang; Qifeng Chen; Pengfei Wan; Di Zhang; Kun Gai

arXiv:2506.04213·cs.CV·June 6, 2025

FullDiT2: Efficient In-Context Conditioning for Video Diffusion Transformers

Xuanhua He, Quande Liu, Zixuan Ye, Weicai Ye, Qiulin Wang, Xintao Wang, Qifeng Chen, Pengfei Wan, Di Zhang, Kun Gai

PDF

Open Access

TL;DR

FullDiT2 introduces an efficient in-context conditioning framework for video diffusion transformers, significantly reducing computation and increasing speed while maintaining or improving video generation quality.

Contribution

It proposes a dynamic token selection and selective context caching mechanism to address redundancy issues in in-context video diffusion models.

Findings

01

Achieves 2-3 times speedup in diffusion step processing

02

Reduces computation without sacrificing video quality

03

Demonstrates effectiveness across six diverse tasks

Abstract

Fine-grained and efficient controllability on video diffusion transformers has raised increasing desires for the applicability. Recently, In-context Conditioning emerged as a powerful paradigm for unified conditional video generation, which enables diverse controls by concatenating varying context conditioning signals with noisy video latents into a long unified token sequence and jointly processing them via full-attention, e.g., FullDiT. Despite their effectiveness, these methods face quadratic computation overhead as task complexity increases, hindering practical deployment. In this paper, we study the efficiency bottleneck neglected in original in-context conditioning video generation framework. We begin with systematic analysis to identify two key sources of the computation inefficiencies: the inherent redundancy within context condition tokens and the computational redundancy in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Image Enhancement Techniques · Visual Attention and Saliency Detection

MethodsDiffusion