# Phased One-Step Adversarial Equilibrium for Video Diffusion Models

**Authors:** Jiaxiang Cheng, Bing Ma, Xuhua Ren, Hongyi Henry Jin, Kai Yu, Peng Zhang, Wenyue Li, Yuan Zhou, Tianxiang Zheng, Qinglin Lu

arXiv: 2508.21019 · 2025-12-29

## TL;DR

This paper introduces V-PAE, a novel distillation framework for large-scale video diffusion models that achieves high-quality, single-step video generation, significantly improving efficiency and maintaining semantic and temporal coherence.

## Contribution

The paper proposes V-PAE, a two-phase distillation method enabling single-step video generation for large models, addressing stability and generalization issues of prior acceleration techniques.

## Key findings

- V-PAE outperforms existing methods by 5.8% in quality score.
- Reduces diffusion latency of large models by 100 times.
- Maintains semantic alignment and temporal coherence in generated videos.

## Abstract

Video diffusion generation suffers from critical sampling efficiency bottlenecks, particularly for large-scale models and long contexts. Existing video acceleration methods, adapted from image-based techniques, lack a single-step distillation ability for large-scale video models and task generalization for conditional downstream tasks. To bridge this gap, we propose the Video Phased Adversarial Equilibrium (V-PAE), a distillation framework that enables high-quality, single-step video generation from large-scale video models. Our approach employs a two-phase process. (i) Stability priming is a warm-up process to align the distributions of real and generated videos. It improves the stability of single-step adversarial distillation in the following process. (ii) Unified adversarial equilibrium is a flexible self-adversarial process that reuses generator parameters for the discriminator backbone. It achieves a co-evolutionary adversarial equilibrium in the Gaussian noise space. For the conditional tasks, we primarily preserve video-image subject consistency, which is caused by semantic degradation and conditional frame collapse during the distillation training in image-to-video (I2V) generation. Comprehensive experiments on VBench-I2V demonstrate that V-PAE outperforms existing acceleration methods by an average of 5.8% in the overall quality score, including semantic alignment, temporal coherence, and frame quality. In addition, our approach reduces the diffusion latency of the large-scale video model (e.g., Wan2.1-I2V-14B) by 100 times, while preserving competitive performance.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/2508.21019/full.md

## Figures

17 figures with captions in the complete paper: https://tomesphere.com/paper/2508.21019/full.md

---
Source: https://tomesphere.com/paper/2508.21019