Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

Yue Yang; Ajay Patel; Matt Deitke; Tanmay Gupta; Luca Weihs; Andrew Head; Mark Yatskar; Chris Callison-Burch; Ranjay Krishna; Aniruddha Kembhavi; Christopher Clark

arXiv:2502.14846·cs.CV·May 22, 2025

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation

Yue Yang, Ajay Patel, Matt Deitke, Tanmay Gupta, Luca Weihs, Andrew Head, Mark Yatskar, Chris Callison-Burch, Ranjay Krishna, Aniruddha Kembhavi, Christopher Clark

PDF

Open Access 5 Datasets 1 Video

TL;DR

CoSyn introduces a novel framework leveraging large language models to generate synthetic, text-rich multimodal data from code, significantly enhancing vision-language model performance on diverse benchmarks.

Contribution

This work presents a new method for automatic synthetic data generation using code prompts from LLMs, improving training data diversity for text-rich image understanding.

Findings

01

Achieved state-of-the-art results on seven benchmarks.

02

Generated 400K images and 2.7M instruction-tuning samples.

03

Enabled grounding capabilities in vision-language models.

Abstract

Reasoning about images with rich text, such as charts and documents, is a critical application of vision-language models (VLMs). However, VLMs often struggle in these domains due to the scarcity of diverse text-rich vision-language data. To address this challenge, we present CoSyn, a framework that leverages the coding capabilities of text-only large language models (LLMs) to automatically create synthetic text-rich multimodal data. Given input text describing a target domain (e.g., "nutrition fact labels"), CoSyn prompts an LLM to generate code (Python, HTML, LaTeX, etc.) for rendering synthetic images. With the underlying code as textual representations of the synthetic images, CoSyn can generate high-quality instruction-tuning data, again relying on a text-only LLM. Using CoSyn, we constructed a dataset comprising 400K images and 2.7M rows of vision-language instruction-tuning data.…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

Videos

Scaling Text-Rich Image Understanding via Code-Guided Synthetic Multimodal Data Generation· underline

Taxonomy

TopicsMultimodal Machine Learning Applications · Handwritten Text Recognition Techniques · Image Retrieval and Classification Techniques

MethodsLLaMA