PointT2I: LLM-based text-to-image generation via keypoints
Taekyung Lee, Donggyu Lee, Myungjoo Kang

TL;DR
PointT2I introduces a novel LLM-based framework for text-to-image generation that accurately captures human poses described in prompts by generating keypoints and refining images through feedback, without requiring fine-tuning.
Contribution
This paper presents the first LLM-based approach for keypoints-guided image generation directly from text prompts without external references or fine-tuning.
Findings
Accurately generates human pose-aligned images from prompts.
Uses LLM to generate keypoints and assess semantic consistency.
Achieves pose accuracy without external data or fine-tuning.
Abstract
Text-to-image (T2I) generation model has made significant advancements, resulting in high-quality images aligned with an input prompt. However, despite T2I generation's ability to generate fine-grained images, it still faces challenges in accurately generating images when the input prompt contains complex concepts, especially human pose. In this paper, we propose PointT2I, a framework that effectively generates images that accurately correspond to the human pose described in the prompt by using a large language model (LLM). PointT2I consists of three components: Keypoint generation, Image generation, and Feedback system. The keypoint generation uses an LLM to directly generate keypoints corresponding to a human pose, solely based on the input prompt, without external references. Subsequently, the image generation produces images based on both the text prompt and the generated keypoints…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMathematics, Computing, and Information Processing · Handwritten Text Recognition Techniques
