Synergistic Dual Spatial-aware Generation of Image-to-Text and Text-to-Image
Yu Zhao, Hao Fei, Xiangtai Li, Libo Qin, Jiayi Ji, Hongyuan Zhu, Meishan Zhang, Min Zhang, Jianguo Wei

TL;DR
This paper introduces a dual learning framework using a novel 3D scene graph and a diffusion process to improve spatial understanding in image-to-text and text-to-image tasks, achieving significant performance gains.
Contribution
The work proposes a dual learning approach with a shared 3D scene graph and a diffusion model to enhance spatial reasoning in image-text generation tasks.
Findings
Outperforms mainstream T2I and I2T methods on VSD dataset
Demonstrates the effectiveness of dual learning in spatial understanding
Provides in-depth analysis of the dual learning strategy's advantages
Abstract
In the visual spatial understanding (VSU) area, spatial image-to-text (SI2T) and spatial text-to-image (ST2I) are two fundamental tasks that appear in dual form. Existing methods for standalone SI2T or ST2I perform imperfectly in spatial understanding, due to the difficulty of 3D-wise spatial feature modeling. In this work, we consider modeling the SI2T and ST2I together under a dual learning framework. During the dual framework, we then propose to represent the 3D spatial scene features with a novel 3D scene graph (3DSG) representation that can be shared and beneficial to both tasks. Further, inspired by the intuition that the easier 3Dimage and 3Dtext processes also exist symmetrically in the ST2I and SI2T, respectively, we propose the Spatial Dual Discrete Diffusion (SD) framework, which utilizes the intermediate features of the 3DX processes to guide the hard…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsImage Retrieval and Classification Techniques · Video Analysis and Summarization
MethodsDiffusion
