APLA: Additional Perturbation for Latent Noise with Adversarial Training   Enables Consistency

Yupu Yao; Shangqi Deng; Zihan Cao; Harry Zhang; Liang-Jian Deng

arXiv:2308.12605·cs.CV·May 3, 2024

APLA: Additional Perturbation for Latent Noise with Adversarial Training Enables Consistency

Yupu Yao, Shangqi Deng, Zihan Cao, Harry Zhang, Liang-Jian Deng

PDF

Open Access

TL;DR

This paper introduces APLA, a novel diffusion-based text-to-video generation method that enhances temporal consistency by extracting and refining inherent input information using an auxiliary transformer network.

Contribution

The paper proposes APLA, a new diffusion model framework with an auxiliary transformer network to improve temporal consistency in video generation from a single input.

Findings

01

Significant improvement in video consistency both qualitatively and quantitatively.

02

Effective extraction of inherent input information to refine pixel predictions.

03

Utilization of a hybrid transformer-convolution architecture for temporal modeling.

Abstract

Diffusion models have exhibited promising progress in video generation. However, they often struggle to retain consistent details within local regions across frames. One underlying cause is that traditional diffusion models approximate Gaussian noise distribution by utilizing predictive noise, without fully accounting for the impact of inherent information within the input itself. Additionally, these models emphasize the distinction between predictions and references, neglecting information intrinsic to the videos. To address this limitation, inspired by the self-attention mechanism, we propose a novel text-to-video (T2V) generation network structure based on diffusion models, dubbed Additional Perturbation for Latent noise with Adversarial training (APLA). Our approach only necessitates a single video as input and builds upon pre-trained stable diffusion networks. Notably, we introduce…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Machine Learning in Healthcare · Human Pose and Action Recognition

MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Softmax · Dropout · Position-Wise Feed-Forward Layer · Byte Pair Encoding · Adam · Layer Normalization · Dense Connections