FontUse: A Data-Centric Approach to Style- and Use-Case-Conditioned In-Image Typography
Xia Xin, Yuki Endo, Yoshihiro Kanamori

TL;DR
This paper introduces FontUse, a data-centric method that enhances text-to-image models to better control typography by training on a large, annotated dataset focused on style and use-case conditions, improving prompt adherence.
Contribution
The paper presents a novel annotation pipeline and dataset, FontUse, enabling fine-tuning of image generation models for improved typography control without architectural changes.
Findings
Models trained with FontUse annotations better match typographic prompts.
The Long-CLIP metric effectively measures typography-prompt alignment.
FontUse improves consistency of generated text with specified styles and use cases.
Abstract
Recent text-to-image models can generate high-quality images from natural-language prompts, yet controlling typography remains challenging: requested typographic appearance is often ignored or only weakly followed. We address this limitation with a data-centric approach that trains image generation models using targeted supervision derived from a structured annotation pipeline specialized for typography. Our pipeline constructs a large-scale typography-focused dataset, FontUse, consisting of about 70K images annotated with user-friendly prompts, text-region locations, and OCR-recognized strings. The annotations are automatically produced using segmentation models and multimodal large language models (MLLMs). The prompts explicitly combine font styles (e.g., serif, script, elegant) and use cases (e.g., wedding invitations, coffee-shop menus), enabling intuitive specification even for…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsGenerative Adversarial Networks and Image Synthesis · Handwritten Text Recognition Techniques · Multimodal Machine Learning Applications
