ProVision: Programmatically Scaling Vision-centric Instruction Data for   Multimodal Language Models

Jieyu Zhang; Le Xue; Linxin Song; Jun Wang; Weikai Huang; Manli Shu,; An Yan; Zixian Ma; Juan Carlos Niebles; Silvio Savarese; Caiming Xiong,; Zeyuan Chen; Ranjay Krishna; Ran Xu

arXiv:2412.07012·cs.CV·December 31, 2024

ProVision: Programmatically Scaling Vision-centric Instruction Data for Multimodal Language Models

Jieyu Zhang, Le Xue, Linxin Song, Jun Wang, Weikai Huang, Manli Shu,, An Yan, Zixian Ma, Juan Carlos Niebles, Silvio Savarese, Caiming Xiong,, Zeyuan Chen, Ranjay Krishna, Ran Xu

PDF

Open Access 1 Repo 1 Datasets

TL;DR

ProVision introduces a scalable, interpretable, and cost-effective method for generating large-scale vision-centric instruction data using scene graphs and human-written programs, improving multimodal model performance.

Contribution

It presents a novel programmatic approach employing scene graphs and human programs to synthesize diverse, factual vision instruction data, reducing reliance on costly LLMs and MLMs.

Findings

01

Generated over 10 million instruction data points from datasets.

02

Achieved up to 8% performance improvement on multiple benchmarks.

03

Enhanced model performance in both pretraining and instruction tuning stages.

Abstract

With the rise of multimodal applications, instruction data has become critical for training multimodal language models capable of understanding complex image-based queries. Existing practices rely on powerful but costly large language models (LLMs) or multimodal language models (MLMs) to produce instruction data. These are often prone to hallucinations, licensing issues and the generation process is often hard to scale and interpret. In this work, we present a programmatic approach that employs scene graphs as symbolic representations of images and human-written programs to systematically synthesize vision-centric instruction data. Our approach ensures the interpretability and controllability of the data generation process and scales efficiently while maintaining factual accuracy. By implementing a suite of 24 single-image, 14 multi-image instruction generators, and a scene graph…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

jieyuz2/provision
pytorchOfficial

Datasets

Salesforce/ProVision-10M
dataset· 200 dl
200 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Speech and dialogue systems