TL;DR
This paper introduces PMT2I, a multilingual prompting method that leverages the multilingual capabilities of large multimodal models to improve text-to-image generation, especially for complex and detailed descriptions.
Contribution
The paper proposes a novel multilingual prompting approach that enhances text comprehension in large multimodal models for improved image generation quality.
Findings
PMT2I outperforms baseline prompts in general, compositional, and fine-grained assessments.
The method achieves higher human preference alignment.
PMT2I generates more diverse images and improves reranking performance.
Abstract
Previous work on augmenting large multimodal models (LMMs) for text-to-image (T2I) generation has focused on enriching the input space of in-context learning (ICL). This includes providing a few demonstrations and optimizing image descriptions to be more detailed and logical. However, as demand for more complex and flexible image descriptions grows, enhancing comprehension of input text within the ICL paradigm remains a critical yet underexplored area. In this work, we extend this line of research by constructing parallel multilingual prompts aimed at harnessing the multilingual capabilities of LMMs. More specifically, we translate the input text into several languages and provide the models with both the original text and the translations. Experiments on two LMMs across 3 benchmarks show that our method, PMT2I, achieves superior performance in general, compositional, and fine-grained…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
