SG-Adapter: Enhancing Text-to-Image Generation with Scene Graph Guidance
Guibao Shen, Luozhou Wang, Jiantao Lin, Wenhang Ge, Chaozhe Zhang, Xin, Tao, Yuan Zhang, Pengfei Wan, Zhongyuan Wang, Guangyong Chen, Yijun Li,, Ying-Cong Chen

TL;DR
This paper introduces SG-Adapter, a method that uses scene graph guidance to improve the accuracy and structural control of text-to-image generation, especially in complex multi-object scenarios.
Contribution
We propose SG-Adapter, leveraging scene graphs to enhance text embeddings for better image correspondence, and curate a high-quality dataset MultiRels for training and evaluation.
Findings
Improved alignment between generated images and scene graphs.
Enhanced structural control in complex multi-object scenarios.
Validated effectiveness through qualitative and quantitative metrics.
Abstract
Recent advancements in text-to-image generation have been propelled by the development of diffusion models and multi-modality learning. However, since text is typically represented sequentially in these models, it often falls short in providing accurate contextualization and structural control. So the generated images do not consistently align with human expectations, especially in complex scenarios involving multiple objects and relationships. In this paper, we introduce the Scene Graph Adapter(SG-Adapter), leveraging the structured representation of scene graphs to rectify inaccuracies in the original text embeddings. The SG-Adapter's explicit and non-fully connected graph representation greatly improves the fully connected, transformer-based text representations. This enhancement is particularly notable in maintaining precise correspondence in scenarios involving multiple…
Peer Reviews
Decision·Submitted to ICLR 2025
1. The method is simple and straightforward in solving the identified problem. 2. The empirical improvement is significant.
1. Some figures are not clear. 2. Some confusing notations.
a. The proposed SG-adapter is effective in correcting the incorrect contextualization in text embeddings and enhancing the structural semantics generation capabilities of current text-to-image models. b. Both qualitative and quantitative experiments demonstrate SG-Adapter outperforms compared SOTA methods.
a. The format of citations in this paper need be rectified. b. This method requires the construction of a high-quality dataset, and it is mainly effective for the relationships within the constructed dataset, which will limit its generalization ability. c. The quantitative results are insufficient and the author should provide more results to prove the its effectiveness.
- The motivation is straightforward. The description of the problem and method is simple, clear and understandable. - Entity interaction information is introduced through scene graph to optimize the generation of entity interaction, which is relatively direct. - Judging from the demo and numerical comparison results, there are good results.
- The novelty in method design is limited. It only performs an attention on the text embedding based on the scene graph relationship. It can even be considered as a replacement of the key and value values in the cross attention structure. Simplicity is not a disadvantage. If it is simple, direct and works, the author can provide some discussions to analyze the most important factors behind it. Is the adapter structure important or is the scene graph information the most important? For example, i
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Digital Humanities and Scholarship
MethodsALIGN · Diffusion
