TL;DR
This paper introduces a self-refining method for video sampling that iteratively improves generated videos at inference time using the generator itself as a denoising autoencoder, enhancing motion realism without extra training.
Contribution
It proposes a novel self-refining inference technique for video generators, including an uncertainty-aware refinement strategy to improve physical realism and motion coherence.
Findings
Achieves over 70% human preference over baseline samplers.
Significantly improves motion coherence and physics alignment in generated videos.
Demonstrates effectiveness across state-of-the-art video generators.
Abstract
Modern video generators still struggle with complex physical dynamics, often falling short of physical realism. Existing approaches address this using external verifiers or additional training on augmented data, which is computationally expensive and still limited in capturing fine-grained motion. In this work, we present self-refining video sampling, a simple method that uses a pre-trained video generator trained on large-scale datasets as its own self-refiner. By interpreting the generator as a denoising autoencoder, we enable iterative inner-loop refinement at inference time without any external verifier or additional training. We further introduce an uncertainty-aware refinement strategy that selectively refines regions based on self-consistency, which prevents artifacts caused by over-refinement. Experiments on state-of-the-art video generators demonstrate significant improvements…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · 3D Shape Modeling and Analysis · Human Pose and Action Recognition
