OCR-VQGAN: Taming Text-within-Image Generation

Juan A. Rodriguez; David Vazquez; Issam Laradji; Marco Pedersoli; Pau; Rodriguez

arXiv:2210.11248·cs.CV·October 26, 2022·1 cites

OCR-VQGAN: Taming Text-within-Image Generation

Juan A. Rodriguez, David Vazquez, Issam Laradji, Marco Pedersoli, Pau, Rodriguez

PDF

Open Access 3 Repos 2 Videos

TL;DR

OCR-VQGAN is a novel image generation model that incorporates OCR features to produce high-fidelity figures with readable text, addressing a previously underexplored area in diagram and figure synthesis.

Contribution

The paper introduces OCR-VQGAN, a new model leveraging OCR pre-trained features and a novel dataset for improved figure and diagram generation with readable text.

Findings

01

Effective preservation of text in generated figures.

02

Improved figure reconstruction quality.

03

Impact of perceptual loss weighting on results.

Abstract

Synthetic image generation has recently experienced significant improvements in domains such as natural image or art generation. However, the problem of figure and diagram generation remains unexplored. A challenging aspect of generating figures and diagrams is effectively rendering readable texts within the images. To alleviate this problem, we present OCR-VQGAN, an image encoder, and decoder that leverages OCR pre-trained features to optimize a text perceptual loss, encouraging the architecture to preserve high-fidelity text and diagram structure. To explore our approach, we introduce the Paper2Fig100k dataset, with over 100k images of figures and texts from research papers. The figures show architecture diagrams and methodologies of articles available at arXiv.org from fields like artificial intelligence and computer vision. Figures usually include text and discrete objects, e.g.,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Videos

OCR-VQGAN: Taming Text-within-Image Generation· youtube

Taxonomy

TopicsGenerative Adversarial Networks and Image Synthesis · Multimodal Machine Learning Applications · Human Motion and Animation