Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion

Jacob Hansen; Wei Lin; Junmo Kang; Muhammad Jehanzeb Mirza; Hongyin Luo; Rogerio Feris; Alan Ritter; James Glass; Leonid Karlinsky

arXiv:2505.18115·cs.CV·May 26, 2025

Instructify: Demystifying Metadata to Visual Instruction Tuning Data Conversion

Jacob Hansen, Wei Lin, Junmo Kang, Muhammad Jehanzeb Mirza, Hongyin Luo, Rogerio Feris, Alan Ritter, James Glass, Leonid Karlinsky

PDF

TL;DR

This paper introduces Instructify, an open, unified method for converting image metadata into visual instruction tuning data using open LLMs, improving data quality and scalability without relying on costly proprietary APIs.

Contribution

Instructify provides a reproducible, open-source framework for metadata-to-VisIT data conversion, enhancing data quality and enabling scalable visual instruction tuning with open models.

Findings

01

Improves GPT-4 instruction quality by ~3% on average

02

Achieves up to 12% improvement on individual benchmarks with open models

03

Enables scalable data generation for niche domains

Abstract

Visual Instruction Tuning (VisIT) data, commonly available as human-assistant conversations with images interleaved in the human turns, are currently the most widespread vehicle for aligning strong LLMs to understand visual inputs, converting them to strong LMMs. While many VisIT datasets are available, most are constructed using ad-hoc techniques developed independently by different groups. They are often poorly documented, lack reproducible code, and rely on paid, closed-source model APIs such as GPT-4, Gemini, or Claude to convert image metadata (labels) into VisIT instructions. This leads to high costs and makes it challenging to scale, enhance quality, or generate VisIT data for new datasets. In this work, we address these challenges and propose an open and unified recipe and approach,~\textbf{\method}, for converting available metadata to VisIT instructions using open LLMs. Our…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.