Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models

Shojiro Yamabe; Futa Waseda; Daiki Shiono; Tsubasa Takahashi

arXiv:2512.03463·cs.CV·December 4, 2025

Text-Printed Image: Bridging the Image-Text Modality Gap for Text-centric Training of Large Vision-Language Models

Shojiro Yamabe, Futa Waseda, Daiki Shiono, Tsubasa Takahashi

PDF

Open Access

TL;DR

This paper introduces Text-Printed Images (TPI), a simple method to generate synthetic images from text descriptions, bridging the modality gap and enabling effective low-cost, text-centric training of large vision-language models without real images.

Contribution

The paper proposes TPI, a novel, low-cost approach to generate synthetic images from text, improving text-centric training of LVLMs and reducing reliance on costly image datasets.

Findings

01

TPI outperforms diffusion-model generated images in training effectiveness.

02

TPI enhances model performance across multiple benchmarks.

03

TPI serves as an effective data augmentation strategy.

Abstract

Recent large vision-language models (LVLMs) have been applied to diverse VQA tasks. However, achieving practical performance typically requires task-specific fine-tuning with large numbers of image-text pairs, which are costly to collect. In this work, we study text-centric training, a setting where only textual descriptions are available and no real images are provided, as a paradigm for low-cost data scaling. Unlike images, whose collection is often restricted by privacy constraints and scarcity in niche domains, text is widely available. Moreover, text is easily editable, enabling automatic diversification and expansion with LLMs at minimal human effort. While this offers clear advantages over image collection in terms of scalability and cost, training on raw text without images still yields limited gains on VQA tasks because of the image-text modality gap. To address this issue, we…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Generative Adversarial Networks and Image Synthesis · Advanced Neural Network Applications