M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

Nina Shvetsova; Goutam Bhat; Prune Truong; Hilde Kuehne; Federico Tombari

arXiv:2505.16565·cs.CV·November 26, 2025

M2SVid: End-to-End Inpainting and Refinement for Monocular-to-Stereo Video Conversion

Nina Shvetsova, Goutam Bhat, Prune Truong, Hilde Kuehne, Federico Tombari

PDF

TL;DR

This paper introduces M2SVid, an end-to-end deep learning model that converts monocular videos into stereo videos by inpainting and refining the right view, achieving superior quality and speed compared to previous methods.

Contribution

The paper presents a novel architecture extending Stable Video Diffusion for monocular-to-stereo conversion, with modifications for better inpainting and efficiency, trained end-to-end without iterative steps.

Findings

01

Outperforms previous state-of-the-art methods in quality.

02

Ranks best 2.6x more often in user studies.

03

Operates 6x faster than competing approaches.

Abstract

We tackle the problem of monocular-to-stereo video conversion and propose a novel architecture for inpainting and refinement of the warped right view obtained by depth-based reprojection of the input left view. We extend the Stable Video Diffusion (SVD) model to utilize the input left video, the warped right video, and the disocclusion masks as conditioning input to generate a high-quality right camera view. In order to effectively exploit information from neighboring frames for inpainting, we modify the attention layers in SVD to compute full attention for discoccluded pixels. Our model is trained to generate the right view video in an end-to-end manner without iterative diffusion steps by minimizing image space losses to ensure high-quality generation. Our approach outperforms previous state-of-the-art methods, being ranked best 2.6x more often than the second-place method in a user…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSoftmax · Attention Is All You Need · Inpainting · Diffusion