SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild

Jiawei Liu; Yuanzhi Zhu; Feiyu Gao; Zhibo Yang; Peng Wang; Junyang; Lin; Xinggang Wang; Wenyu Liu

arXiv:2501.02962·cs.CV·January 8, 2025

SceneVTG++: Controllable Multilingual Visual Text Generation in the Wild

Jiawei Liu, Yuanzhi Zhu, Feiyu Gao, Zhibo Yang, Peng Wang, Junyang, Lin, Xinggang Wang, Wenyu Liu

PDF

Open Access

TL;DR

SceneVTG++ is a two-stage framework that generates realistic, scene-relevant, controllable multilingual text in natural images, improving OCR training and surpassing previous methods in quality and utility.

Contribution

The paper introduces SceneVTG++, a novel two-stage method combining large language models and diffusion models for controllable, scene-aware multilingual text generation in natural images.

Findings

01

Achieves state-of-the-art text generation quality.

02

Generated images improve OCR training tasks.

03

Effectively controls text attributes like font and color.

Abstract

Generating visual text in natural scene images is a challenging task with many unsolved problems. Different from generating text on artificially designed images (such as posters, covers, cartoons, etc.), the text in natural scene images needs to meet the following four key criteria: (1) Fidelity: the generated text should appear as realistic as a photograph and be completely accurate, with no errors in any of the strokes. (2) Reasonability: the text should be generated on reasonable carrier areas (such as boards, signs, walls, etc.), and the generated text content should also be relevant to the scene. (3) Utility: the generated text can facilitate to the training of natural scene OCR (Optical Character Recognition) tasks. (4) Controllability: The attribute of the text (such as font and color) should be controllable as needed. In this paper, we propose a two stage method, SceneVTG++,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Video Analysis and Summarization · Human Motion and Animation

MethodsDiffusion