SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

Peng Dai; Feitong Tan; Qiangeng Xu; David Futschik; Ruofei Du; Sean; Fanello; Xiaojuan Qi; Yinda Zhang

arXiv:2407.00367·cs.CV·July 2, 2024

SVG: 3D Stereoscopic Video Generation via Denoising Frame Matrix

Peng Dai, Feitong Tan, Qiangeng Xu, David Futschik, Ruofei Du, Sean, Fanello, Xiaojuan Qi, Yinda Zhang

PDF

Open Access 3 Reviews

TL;DR

This paper introduces a novel pose-free, training-free method for generating 3D stereoscopic videos from monocular videos using a frame matrix inpainting framework and depth estimation, without scene optimization or model fine-tuning.

Contribution

It presents a new approach that warps monocular videos into stereoscopic views and employs a frame matrix inpainting framework with disocclusion boundary re-injection, improving 3D stereoscopic video quality.

Findings

01

Significant improvement over previous methods in generating stereoscopic videos.

02

Effective in maintaining semantic coherence and consistency in generated videos.

03

Validated on multiple generative models including Sora, Lumiere, WALT, and Zeroscope.

Abstract

Video generation models have demonstrated great capabilities of producing impressive monocular videos, however, the generation of 3D stereoscopic video remains under-explored. We propose a pose-free and training-free approach for generating 3D stereoscopic videos using an off-the-shelf monocular video generation model. Our method warps a generated monocular video into camera views on stereoscopic baseline using estimated video depth, and employs a novel frame matrix video inpainting framework. The framework leverages the video generation model to inpaint frames observed from different timestamps and views. This effective approach generates consistent and semantically coherent stereoscopic videos without scene optimization or model fine-tuning. Moreover, we develop a disocclusion boundary re-injection scheme that further improves the quality of video inpainting by alleviating the…

Peer Reviews

Decision·ICLR 2025 Poster

Reviewer 01Rating 6Confidence 4

Strengths

1. The proposed method is pose-free and training-free. 2. This paper is clearly written and easy to understand. 3. Extensive experiments demonstrate the effectiveness of the proposed frame matrix and the disocclusion boundary re-injection scheme.

Weaknesses

1. As the model denoises along both temporal and spatial dimensions, one experiment that is missing is the investigation of varying the number of cameras between the left and right views. How does this variation impact the final quality and overall efficiency of the process? Is it feasible to use fewer internal camera views to save time? 2. Currently all the experiments have been conducted on the synthesized videos. It would be beneficial to explore how the results look like when applied to real

Reviewer 02Rating 8Confidence 3

Strengths

- The paper is well-written and easy to follow. - The proposed frame matrix representation is reasonable, and the extensive experiments support its functionality. Therefore, this paper is sufficiently novel. - The proposed method demonstrates a strong understanding of the challenges specific to 3D video generation, including issues with depth estimation and video inpainting. - I believe that sufficient experiments and ablation studies are presented to support the approach.

Weaknesses

The main weaknesses of the proposed method are the disocclusion boundary artifacts, slightly lower temporal consistency compared to Deep3D, and the need for further improvements in holistic perceptual consistency, especially for certain subjects like human faces.

Reviewer 03Rating 6Confidence 5

Strengths

1. This paper proposes a training-free manner to generate stereo video from monocular video and achieve SOTA performance in training-free manner. 2. The proposed denoising frame matrix uses pre-trained video generation model as the inpainting model, which is the first one to do this in stereo video generation field and offers insight for leveraging video generation model to assist this task.

Weaknesses

1. The analysis for denoising frame matrix is insufficient. Although the authors provide the reason for using video generation model, the theory and the high-level reason of why it works are not clear, please show the theoretical analysis of the proposed frame matrix. 2. Lack citations for the methodology followed by other methods. The presentation in the paper contains certain misleading and deceptive elements. Line 153-161, the viewpoint transfer part is widely used in novel view synthesis [R

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Vision and Imaging · Image and Video Stabilization · Computer Graphics and Visualization Techniques

MethodsInpainting