Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image

Yu Zhao; Hao Fei; Xiangtai Li; Libo Qin; Jiayi Ji; Hongyuan Zhu; Meishan Zhang; Min Zhang; Jianguo Wei

arXiv:2410.15312·cs.CV·September 3, 2025

Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image

Yu Zhao, Hao Fei, Xiangtai Li, Libo Qin, Jiayi Ji, Hongyuan Zhu, Meishan Zhang, Min Zhang, Jianguo Wei

PDF

Open Access 1 Video

TL;DR

This paper introduces a dual learning framework using a novel 3D scene graph and a diffusion process to improve spatial understanding in image-to-text and text-to-image tasks, achieving significant performance gains.

Contribution

The work proposes a dual learning approach with a shared 3D scene graph and a diffusion model to enhance spatial reasoning in image-text generation tasks.

Findings

01

Outperforms mainstream T2I and I2T methods on VSD dataset

02

Demonstrates the effectiveness of dual learning in spatial understanding

03

Provides in-depth analysis of the dual learning strategy's advantages

Abstract

In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial understanding, due to the difficulty of 3D-wise spatial feature modeling. In this work, we consider modeling the SI2T and ST2I together under a dual learning framework. During the dual framework, we then propose to represent the 3D spatial scene features with a novel 3D scene graph (3DSG) representation that can be shared and beneficial to both tasks. Further, inspired by the intuition that the easier 3D $\to$ image and 3D $\to$ text processes also exist symmetrically in the ST2I and SI2T, respectively, we propose the Spatial Dual Discrete Diffusion (SD $^{3}$ ) framework, which utilizes the intermediate features of the 3D $\to$ X processes to guide the hard…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Synergistic Dual Spatial-aware Generation of Image-to-text and Text-to-image· slideslive

Taxonomy

TopicsImage Retrieval and Classification Techniques · Video Analysis and Summarization

MethodsDiffusion