APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency
Yupu Yao, Shangqi Deng, Zihan Cao, Harry Zhang, Liang-Jian Deng

TL;DR
This paper introduces APLA, a novel diffusion-based text-to-video generation method that enhances temporal consistency by extracting and refining inherent input information using an auxiliary transformer network.
Contribution
The paper proposes APLA, a new diffusion model framework with an auxiliary transformer network to improve temporal consistency in video generation from a single input.
Findings
Significant improvement in video consistency both qualitatively and quantitatively.
Effective extraction of inherent input information to refine pixel predictions.
Utilization of a hybrid transformer-convolution architecture for temporal modeling.
Abstract
Diffusion models have exhibited promising progress in video generation. However, they often struggle to retain consistent details within local regions across frames. One underlying cause is that traditional diffusion models approximate Gaussian noise distribution by utilizing predictive noise, without fully accounting for the impact of inherent information within the input itself. Additionally, these models emphasize the distinction between predictions and references, neglecting information intrinsic to the videos. To address this limitation, inspired by the self-attention mechanism, we propose a novel text-to-video (T2V) generation network structure based on diffusion models, dubbed Additional Perturbation for Latent noise with Adversarial training (APLA). Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. Notably, we introduce…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare · Human Pose and Action Recognition
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dropout · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Dense Connections
