ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models
Jieyu Zhang, Le Xue, Linxin Song, Jun Wang, Weikai Huang, Manli Shu,, An Yan, Zixian Ma, Juan Carlos Niebles, Silvio Savarese, Caiming Xiong,, Zeyuan Chen, Ranjay Krishna, Ran Xu

TL;DR
ProVision introduces a scalable, interpretable, and cost-effective method for generating large-scale vision-centric instruction data using scene graphs and human-written programs, improving multimodal model performance.
Contribution
It presents a novel programmatic approach employing scene graphs and human programs to synthesize diverse, factual vision instruction data, reducing reliance on costly LLMs and MLMs.
Findings
Generated over 10 million instruction data points from datasets.
Achieved up to 8% performance improvement on multiple benchmarks.
Enhanced model performance in both pretraining and instruction tuning stages.
Abstract
With the rise of multimodal applications, instruction data has become critical for training multimodal language models capable of understanding complex image-based queries. Existing practices rely on powerful but costly large language models (LLMs) or multimodal language models (MLMs) to produce instruction data. These are often prone to hallucinations, licensing issues and the generation process is often hard to scale and interpret. In this work, we present a programmatic approach that employs scene graphs as symbolic representations of images and human-written programs to systematically synthesize vision-centric instruction data. Our approach ensures the interpretability and controllability of the data generation process and scales efficiently while maintaining factual accuracy. By implementing a suite of 24 single-image, 14 multi-image instruction generators, and a scene graph…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Speech and dialogue systems
