PosterCraft: Rethinking High-Quality Aesthetic Poster Generation in a Unified Framework
SiXiang Chen, Jianyu Lai, Jialin Gao, Tian Ye, Haoyu Chen, Hengyu Shi, Shitong Shao, Yunlong Lin, Song Fei, Zhaohu Xing, Yeying Jin, Junfeng Luo, Xiaoming Wei, Lei Zhu

TL;DR
PosterCraft introduces a unified, end-to-end framework for generating high-quality aesthetic posters by integrating text rendering, artistic content, and layout optimization through multiple training stages and reinforcement learning.
Contribution
It proposes a novel unified framework that replaces modular pipelines with a cascaded, trainable system for aesthetic poster generation, utilizing large-scale datasets and reinforcement learning.
Findings
Outperforms open-source baselines in layout coherence and visual appeal
Achieves quality comparable to state-of-the-art commercial systems
Demonstrates robustness through automated data construction pipeline
Abstract
Generating aesthetic posters is more challenging than simple design images: it requires not only precise text rendering but also the seamless integration of abstract artistic content, striking layouts, and overall stylistic harmony. To address this, we propose PosterCraft, a unified framework that abandons prior modular pipelines and rigid, predefined layouts, allowing the model to freely explore coherent, visually compelling compositions. PosterCraft employs a carefully designed, cascaded workflow to optimize the generation of high-aesthetic posters: (i) large-scale text-rendering optimization on our newly introduced Text-Render-2M dataset; (ii) region-aware supervised fine-tuning on HQ-Poster100K; (iii) aesthetic-text-reinforcement learning via best-of-n preference optimization; and (iv) joint vision-language feedback refinement. Each stage is supported by a fully automated…
Peer Reviews
Decision·ICLR 2026 Poster
+ The proposed abandons modular and layout-constrained designs, enabling holistic integration of text, layout, and artistic content for visually coherent posters. + The work also introduces and leverages large, high-quality, and stage-specific datasets, supporting robust and scalable training. + The proposed method outperforms some recent methods in text accuracy, aesthetics, and prompt alignment.
- When talking about text rendering performance, it is usually important to measure text redner quality and accuracy under different length of words - from simple to complex, e.g. <20 words, 20-60 words, > 60 words. The current work lacks such kind of measurements making it hard to justfiy its strength especially for compelx cases. - Text rendering is an important area includes multiple areas including poster, infographic and scene text. It is not clear why the work only limited to the poster
1. The paper introduces four large-scale, automated datasets (Text-Render-2M, HQ-Poster-100K, etc.) tailored for specific training stages. This pipeline provides high-quality, specialized data for text rendering, style fine-tuning, and preference learning, addressing a major bottleneck in the field. 2. The paper proposes a unified framework that abandons rigid, modular pipelines where layout and text are generated separately. This approach allows the model to holistically explore coherent combi
1. The claimed “unified generative framework” mainly integrates existing methods rather than introducing a fundamentally new generative modeling concept. Each stage—text rendering optimization, preference learning with DPO, and vision-language feedback—relies heavily on prior work. As a result, the contribution is more engineering-oriented than algorithmically innovative, making the paper better suited to an application or dataset construction area rather than the core generative models track.
In general, the paper looks good to me. 1. It shows a good example of the full building pipeline of a high-quality domain-specific image generation system. Core procedures like data collection/curation, preference alignment, and reflection optimization are not only covered but accomplished at high quality. 2. The proposed datasets are very helpful to this field.
None
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Visual Attention and Saliency Detection
