Text2Stereo: Repurposing Stable Diffusion for Stereo Generation with Consistency Rewards

Aakash Garg; Libing Zeng; Andrii Tsarov; Nima Khademi Kalantari

arXiv:2506.05367·cs.CV·July 24, 2025

Text2Stereo: Repurposing Stable Diffusion for Stereo Generation with Consistency Rewards

Aakash Garg, Libing Zeng, Andrii Tsarov, Nima Khademi Kalantari

PDF

Open Access

TL;DR

This paper introduces Text2Stereo, a diffusion-based method that fine-tunes Stable Diffusion with consistency rewards to generate high-quality stereo images from text prompts, addressing dataset scarcity and improving stereo consistency.

Contribution

It presents a novel fine-tuning approach for Stable Diffusion using stereo consistency rewards to generate stereo images from text prompts.

Findings

01

Outperforms existing methods in stereo image quality

02

Achieves high stereo consistency and text alignment

03

Effective even with limited stereo datasets

Abstract

In this paper, we propose a novel diffusion-based approach to generate stereo images given a text prompt. Since stereo image datasets with large baselines are scarce, training a diffusion model from scratch is not feasible. Therefore, we propose leveraging the strong priors learned by Stable Diffusion and fine-tuning it on stereo image datasets to adapt it to the task of stereo generation. To improve stereo consistency and text-to-image alignment, we further tune the model using prompt alignment and our proposed stereo consistency reward functions. Comprehensive experiments demonstrate the superiority of our approach in generating high-quality stereo images across diverse scenarios, outperforming existing methods.

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Advanced Vision and Imaging · Multimodal Machine Learning Applications