Fine-Tuning Stable Diffusion XL for Stylistic Icon Generation: A Comparison of Caption Size
Youssef Sultan, Jiangqin Ma, Yu-Ying Liao

TL;DR
This paper explores fine-tuning methods for Stable Diffusion XL to generate stylistic icons, emphasizing the importance of proper evaluation metrics beyond FID scores, and highlights the limitations of CLIP scores in icon quality assessment.
Contribution
It introduces tailored fine-tuning techniques and critiques existing evaluation metrics for icon generation, proposing more effective approaches for commercial applications.
Findings
FID scores may not reflect icon quality accurately
CLIP scores can misjudge icon similarity and quality
Proper evaluation metrics are crucial for commercial icon generation
Abstract
In this paper, we show different fine-tuning methods for Stable Diffusion XL; this includes inference steps, and caption customization for each image to align with generating images in the style of a commercial 2D icon training set. We also show how important it is to properly define what "high-quality" really is especially for a commercial-use environment. As generative AI models continue to gain widespread acceptance and usage, there emerge many different ways to optimize and evaluate them for various applications. Specifically text-to-image models, such as Stable Diffusion XL and DALL-E 3 require distinct evaluation practices to effectively generate high-quality icons according to a specific style. Although some images that are generated based on a certain style may have a lower FID score (better), we show how this is not absolute in and of itself even for rasterized icons. While FID…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Human Motion and Animation · Video Analysis and Summarization
MethodsContrastive Language-Image Pre-training · ALIGN · Diffusion
