TL;DR
Vivid-VR introduces a novel video restoration method that distills the conceptual understanding of a text-to-video diffusion model to improve texture realism and temporal coherence in photorealistic videos.
Contribution
It proposes a concept distillation training strategy and redesigned control architecture to enhance controllability and quality in video restoration tasks.
Findings
Outperforms existing methods on synthetic and real-world benchmarks.
Achieves high texture realism and temporal consistency.
Demonstrates effectiveness on AIGC videos.
Abstract
We present Vivid-VR, a DiT-based generative video restoration method built upon an advanced T2V foundation model, where ControlNet is leveraged to control the generation process, ensuring content consistency. However, conventional fine-tuning of such controllable pipelines frequently suffers from distribution drift due to limitations in imperfect multimodal alignment, resulting in compromised texture realism and temporal coherence. To tackle this challenge, we propose a concept distillation training strategy that utilizes the pretrained T2V model to synthesize training samples with embedded textual concepts, thereby distilling its conceptual understanding to preserve texture and temporal quality. To enhance generation controllability, we redesign the control architecture with two key components: 1) a control feature projector that filters degradation artifacts from input video latents…
Peer Reviews
Decision·ICLR 2026 Poster
1. It proposes concept distillation with a pre-trained T2V model to generate aligned text–video pairs. 2. It markedly outperforms prior methods. 3. A high-quality dataset is created that should significantly benefit the video-generation community.
Lack of video supplementary results: As a video-oriented work, without video results as supplementary material, it is difficult for the public to intuitively evaluate the model's performance, especially the quality of temporal consistency.
1. Leveraging the capabilities of pretrained T2V models to enhance the video restoration performance is an interesting approach. Using the textual description as a connector, the authors find an effective way to transfer the T2V model’s pretrained knowledge to the video restoration model, which I believe benefits the community. 2. The proposed method outperforms several previous advances by a large margin in a wide range of benchmarks. The experiments are comprehensive, and the qualitative demo
1. Though the idea of transferring the T2V model's capability to downstream tasks is interesting, the proposed method seems to be trivial and similar to other methods. Using a pretrained model to corrupt and reconstruct the visual content is a common way, especially in image enhancement and restoration tasks, and it is also widely adopted to add the textual description during the reconstruction. It is more likely to be a transition from the image restoration task to the video restoration task, w
The method demonstrates strong empirical performance according to the reported evaluations.
Technical novelty and depth are limited. The architectural modifications, while effective, are relatively standard and offer limited insight; it is unlikely that these design choices will significantly influence future research. Showing that synthesized videos can benefit video restoration is a useful observation, but this point alone, at the current level of investigation, does not meet the novelty threshold to me. The proposed components are not sufficiently analyzed. For example, although th
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
