TL;DR
DissolveStereo is a zero-shot framework that generates high-quality stereo videos with consistent depth and temporal coherence using diffusion priors and innovative latent space refinement.
Contribution
It introduces a novel zero-shot stereo video generation method leveraging diffusion models, with a noisy restart strategy and dissolved depth maps for improved coherence.
Findings
Achieves 11.7% improvement in epipolar consistency (MEt3R score)
User studies show 8.0% higher perceived frame quality
User studies show 10.9% higher perceived temporal coherence
Abstract
Generating high-quality stereo videos requires consistent depth perception and temporal coherence across frames. Despite advances in image and video synthesis using diffusion models, producing high-quality stereo videos remains a challenging task due to the difficulty of maintaining consistent temporal and spatial coherence between left and right views. We introduce DissolveStereo, a novel framework for zero-shot stereo video generation that leverages video diffusion priors without requiring paired training data. Our key innovations include a noisy restart strategy to initialize stereo-aware latent representations and an iterative refinement process that progressively harmonizes the latent space, addressing issues like temporal flickering and view inconsistencies. Importantly, we propose the use of dissolved depth maps to streamline latent space operations by reducing high-frequency…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
