Improved Adversarial Diffusion Compression for Real-World Video Super-Resolution
Bin Chen, Weiqi Li, Shijie Zhao, Xuanyu Zhang, Junlin Li, Li Zhang, Jian Zhang

TL;DR
This paper introduces an improved adversarial diffusion compression method for real-world video super-resolution, significantly reducing model complexity and inference time while maintaining high-quality results through a novel distillation and dual-head adversarial scheme.
Contribution
It proposes a new ADC approach that distills a large diffusion Transformer into a lightweight model with temporal awareness and dual-head adversarial training, enhancing efficiency and detail preservation.
Findings
Model complexity reduced by 95%.
Achieves 8× faster inference than the teacher model.
Maintains competitive video super-resolution quality.
Abstract
While many diffusion models have achieved impressive results in real-world video super-resolution (Real-VSR) by generating rich and realistic details, their reliance on multi-step sampling leads to slow inference. One-step networks like SeedVR2, DOVE, and DLoRAL alleviate this through condensing generation into one single step, yet they remain heavy, with billions of parameters and multi-second latency. Recent adversarial diffusion compression (ADC) offers a promising path via pruning and distilling these models into a compact AdcSR network, but directly applying it to Real-VSR fails to balance spatial details and temporal consistency due to its lack of temporal awareness and the limitations of standard adversarial learning. To address these challenges, we propose an improved ADC method for Real-VSR. Our approach distills a large diffusion Transformer (DiT) teacher DOVE equipped with 3D…
Peer Reviews
Decision·ICLR 2026 Poster
1. This paper proposing a “2D spatial + 1D temporal” decoupling hypothesis, and introduces a novel lightweight “2D+1D” architecture. This approach drastically cuts down on parameters and computational load, enabling efficient inference. 2. The paper proposes a novel dual-head adversarial distillation scheme. This scheme effectively balances the richness of spatial details with temporal coherence, which is a critical challenge in the field of video super-resolution.
1. A core assumption of this paper is that a 2D diffusion model is sufficient for synthesizing fine-grained details. However, this assumption is challenged by the experimental results. Currently, the ablation study comparing the 2D and 3D backbones is based only on the DISTS metric. To provide a more balanced and convincing comparison, the authors should consider including additional metrics that measure perceptual quality (e.g., LPIPS) and/or fidelity (e.g., PSNR, SSIM). Furthermore, the qualit
1. The proposed dual-head discriminator effectively disentangles spatial detail enhancement and temporal consistency, addressing a long-standing trade-off in Real-VSR. 2. The results on multiple datasets and metrics (PSNR, LPIPS, MUSIQ, MANIQA, etc.) are convincing and show both efficiency and quality improvements. And the visual quality is also satisfactory. 3. The 2D+1D design combined with adversarial distillation is simple yet efficient, offering clear insights into practical diffusion mod
The paper lacks formal justification for why the dual-head adversarial loss leads to better convergence or perceptual trade-off control.
Originality: The novel "2D + 1D" architecture design, combined with the dual-head discriminator adversarial distillation strategy, effectively decouples the optimization objectives for detail and consistency, demonstrating strong innovation. Quality: Comprehensive experimental designs, including extensive validation on multiple synthetic and real-world datasets, support the effectiveness of the method through both quantitative and qualitative results. Ablation studies also thoroughly verify the
Insufficient Comparative Experiments: Although comparisons are made with several SOTA methods, there is a lack of comparison with recent non-diffusion-based efficient VSR approaches, such as those based on CNNs or lightweight Transformers. Limited Generalization Validation: All experiments are conducted at a fixed resolution (512×512) and frame length (25 frames), without demonstrating performance on longer videos or higher resolutions. Weak Theoretical Support for Dual-Head Discriminator Desig
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Generative Adversarial Networks and Image Synthesis · Image and Video Quality Assessment
