VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

Guanyu Zhou; Yida Yin; Wenhao Chai; Shengbang Tong; Xingyu Fu; Zhuang Liu

arXiv:2604.09531·cs.CV·April 13, 2026

VisionFoundry: Teaching VLMs Visual Perception with Synthetic Images

Guanyu Zhou, Yida Yin, Wenhao Chai, Shengbang Tong, Xingyu Fu, Zhuang Liu

PDF

1 Repo 1 Datasets

TL;DR

VisionFoundry introduces a synthetic data pipeline that enhances vision-language models' visual perception by generating targeted training data from task keywords, improving performance on perception benchmarks.

Contribution

The paper presents a novel synthetic data generation method using large language models and text-to-image synthesis, creating a new dataset that improves VLMs' visual perception capabilities.

Findings

01

Models trained on VisionFoundry-10K improve perception benchmarks by up to 10%.

02

Synthetic supervision addresses low-level visual skill limitations in VLMs.

03

The approach scales favorably with increased data size.

Abstract

Vision-language models (VLMs) still struggle with visual perception tasks such as spatial understanding and viewpoint recognition. One plausible contributing factor is that natural image datasets provide limited supervision for low-level visual skills. This motivates a practical question: can targeted synthetic supervision, generated from only a task keyword such as Depth Order, address these weaknesses? To investigate this question, we introduce VisionFoundry, a task-aware synthetic data generation pipeline that takes only the task name as input and uses large language models (LLMs) to generate questions, answers, and text-to-image (T2I) prompts, then synthesizes images with T2I models and verifies consistency with a proprietary VLM, requiring no reference images or human annotation. Using VisionFoundry, we construct VisionFoundry-10K, a synthetic visual question answering (VQA)…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

zlab-princeton/VisionFoundry
github

Datasets

zlab-princeton/VisionFoundry-10K
dataset· 413 dl
413 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.