SeedVR2: One-Step Video Restoration via Diffusion Adversarial Post-Training
Jianyi Wang, Shanchuan Lin, Zhijie Lin, Yuxi Ren, Meng Wei, Zongsheng Yue, Shangchen Zhou, Hao Chen, Yang Zhao, Ceyuan Yang, Xuefeng Xiao, Chen Change Loy, Lu Jiang

TL;DR
SeedVR2 introduces a one-step diffusion-based video restoration model that achieves high-quality results efficiently by employing adaptive window attention and novel training losses, outperforming existing methods.
Contribution
The paper presents SeedVR2, a novel one-step diffusion-based VR model with adaptive window attention and improved training techniques for high-resolution video restoration.
Findings
Achieves comparable or better performance than existing VR methods.
Operates effectively in a single step, reducing computational cost.
Handles high-resolution video restoration with adaptive mechanisms.
Abstract
Recent advances in diffusion-based video restoration (VR) demonstrate significant improvement in visual quality, yet yield a prohibitive computational cost during inference. While several distillation-based approaches have exhibited the potential of one-step image restoration, extending existing approaches to VR remains challenging and underexplored, particularly when dealing with high-resolution video in real-world settings. In this work, we propose a one-step diffusion-based VR model, termed as SeedVR2, which performs adversarial VR training against real data. To handle the challenging high-resolution VR within a single step, we introduce several enhancements to both model architecture and training procedures. Specifically, an adaptive window attention mechanism is proposed, where the window size is dynamically adjusted to fit the output resolutions, avoiding window inconsistency…
Peer Reviews
Decision·ICLR 2026 Poster
- I think the jump to truly one-step VR with a diffusion transformer (initialized from SeedVR) plus APT is a meaningful step beyond prior one-step image restoration; prior works are mostly teacher-distillation or rely on fixed diffusion priors that cap quality. This work claims distillation-free adversarial post-training against real data after a lightweight progressive distillation stage to bridge the gap, which is interesting for video. - The adaptive window attention to handle arbitrary res
- I am concerned about the compute-heaviness. I think the approach relies heavily on significant compute (72×H100, 10M/5M pairs), which limits reproducibility in typical academic labs despite code release plans. Claims of “largest-ever VR GAN” underscore this. - Scope of degradations. While synthetic degradations follow prior work, I think the paper could better characterize real-world degradation diversity and robustness (e.g., compression artifacts, rolling shutter, severe motion blur) bey
- The introduction of adaptive window attention effectively reduces boundary artifacts when processing high-resolution frames. - The training strategy which combines RpGAN, approximate R2 regularization, feature-matching losses, and progressive distillation to ensure stable convergence and high perceptual quality is comprehensive. - The experiments are extensive and include both synthetic and real-world data, multiple objective and perceptual metrics, as well as a well-organized user study.
- My main concern is that the novelty of the method is somewhat limited, as it largely builds upon the existing Adversarial Post-Training (APT) framework, and the paper does not clearly explain the fundamental differences or new contributions beyond APT. - The training process is extremely resource-intensive, requiring 72 H100 GPUs, which significantly limits reproducibility and practical accessibility. - The method’s robustness under challenging conditions, such as heavy degradations, large m
1. The paper introduces a novel one-step VR method by applying APT to diffusion-based models, reducing the computational burden significantly compared to traditional multi-step approaches. 2. The adaptive window attention mechanism for handling high-resolution videos and the feature matching loss for training stability are key contributions that improve the model's performance and robustness across varying video resolutions. 3. The method shows promising quantitative and qualitative results, o
1. The paper lacks comparisons with the latest VSR methods presented at NeurIPS 2025 (such as DLoraL [1] and DOVE [2]). The authors should include comparisons with these methods to better demonstrate the competitiveness of the proposed approach. 2. The paper does not provide results trained on public datasets (such as REDS). The reported improvements might stem from using a larger private dataset. Will the authors make the dataset publicly available? 3. Despite achieving faster inference, the
Code & Models
- 🤗ByteDance-Seed/SeedVR2-3Bmodel· 2.2k dl· ♡ 1022.2k dl♡ 102
- 🤗ByteDance-Seed/SeedVR2-7Bmodel· 6.4k dl· ♡ 1156.4k dl♡ 115
- 🤗ByteDance-Seed/SeedVR-7Bmodel· 56 dl· ♡ 856 dl♡ 8
- 🤗ByteDance-Seed/SeedVR-3Bmodel· 76 dl· ♡ 476 dl♡ 4
- 🤗SassyDiffusion/SeedVR2-7B_FP32model· ♡ 4♡ 4
- 🤗SassyDiffusion/SeedVR2-7B_BF16model· ♡ 3♡ 3
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Image Processing Techniques · Generative Adversarial Networks and Image Synthesis · Adversarial Robustness in Machine Learning
MethodsSoftmax · Attention Is All You Need
