Visual Text Generation in the Wild

Yuanzhi Zhu; Jiawei Liu; Feiyu Gao; Wenyu Liu; Xinggang Wang; Peng; Wang; Fei Huang; Cong Yao; and Zhibo Yang

arXiv:2407.14138·cs.CV·November 5, 2024

Visual Text Generation in the Wild

Yuanzhi Zhu, Jiawei Liu, Feiyu Gao, Wenyu Liu, Xinggang Wang, Peng, Wang, Fei Huang, Cong Yao, and Zhibo Yang

PDF

Open Access 1 Repo

TL;DR

This paper introduces SceneVTG, a novel visual text generator that produces high-quality, scene-coherent, and utilitarian text images in the wild by combining multimodal language models and diffusion techniques.

Contribution

The paper presents SceneVTG, a two-stage framework integrating multimodal large language models and diffusion models for high-quality, scene-aware text image generation in real-world scenarios.

Findings

01

Outperforms existing rendering and diffusion methods in fidelity and reasonability.

02

Generates images that improve text detection and recognition tasks.

03

Demonstrates superior utility of generated images in practical applications.

Abstract

Recently, with the rapid advancements of generative models, the field of visual text generation has witnessed significant progress. However, it is still challenging to render high-quality text images in real-world scenarios, as three critical criteria should be satisfied: (1) Fidelity: the generated text images should be photo-realistic and the contents are expected to be the same as specified in the given conditions; (2) Reasonability: the regions and contents of the generated text should cohere with the scene; (3) Utility: the generated text images can facilitate related tasks (e.g., text detection and recognition). Upon investigation, we find that existing methods, either rendering-based or diffusion-based, can hardly meet all these aspects simultaneously, limiting their application range. Therefore, we propose in this paper a visual text generator (termed SceneVTG), which can…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

alibabaresearch/advancedliteratemachinery
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDigital Storytelling and Education

MethodsDiffusion