TextPixs: Glyph-Conditioned Diffusion with Character-Aware Attention and OCR-Guided Supervision

Syeda Anshrah Gillani; Mirza Samad Ahmed Baig; Osama Ahmed Khan; Shahid Munir Shah; Umema Mujeeb; Maheen Ali

arXiv:2507.06033·cs.CV·July 9, 2025

TextPixs: Glyph-Conditioned Diffusion with Character-Aware Attention and OCR-Guided Supervision

Syeda Anshrah Gillani, Mirza Samad Ahmed Baig, Osama Ahmed Khan, Shahid Munir Shah, Umema Mujeeb, Maheen Ali

PDF

Open Access

TL;DR

This paper introduces GCDA, a diffusion-based model that generates readable, correctly spelled text in images by combining glyph-aware encoding, character-specific attention, and OCR-guided fine-tuning, achieving state-of-the-art results.

Contribution

The paper proposes a novel glyph-conditioned diffusion framework with character-aware attention and OCR supervision, improving text readability and accuracy in generated images.

Findings

01

Achieves lower Character Error Rate (0.08) compared to previous models (0.21).

02

Outperforms in human perception and text rendering metrics.

03

Maintains high image quality with FID of 14.3.

Abstract

The modern text-to-image diffusion models boom has opened a new era in digital content production as it has proven the previously unseen ability to produce photorealistic and stylistically diverse imagery based on the semantics of natural-language descriptions. However, the consistent disadvantage of these models is that they cannot generate readable, meaningful, and correctly spelled text in generated images, which significantly limits the use of practical purposes like advertising, learning, and creative design. This paper introduces a new framework, namely Glyph-Conditioned Diffusion with Character-Aware Attention (GCDA), using which a typical diffusion backbone is extended by three well-designed modules. To begin with, the model has a dual-stream text encoder that encodes both semantic contextual information and explicit glyph representations, resulting in a character-aware…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Digital Humanities and Scholarship

MethodsDiffusion · Attentive Walk-Aggregating Graph Neural Network