Generative Visual Instruction Tuning
Jefferson Hernandez, Ruben Villegas, Vicente Ordonez

TL;DR
This paper introduces GenLLaVA, a multimodal model trained with a new instruction set combining GPT-4V generated data and existing datasets, enhancing zero-shot visual understanding, generation, and editing capabilities.
Contribution
It presents a novel instruction tuning strategy that integrates multiple pretrained models to create a versatile visual assistant, outperforming previous models like LLaVA.
Findings
GenLLaVA surpasses LLaVA in visual understanding tasks.
Achieves competitive results with models like Unified-IO 2.
Open-sources dataset, code, and checkpoints for community use.
Abstract
We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pretrained models through instruction finetuning: Mistral for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities superior to LLaVA and additionally demonstrates competitive results with native multimodal…
Peer Reviews
Decision·Submitted to ICLR 2025
Originality: - New Dataset Curation: The creation of a new multimodal instruction-following dataset that amalgamates image understanding, generation, and editing is innovative. It addresses the need for diverse training data to support complex multimodal tasks. - Single-Stage Training Strategy: Moving away from the traditional multi-stage training pipeline to a single-stage training recipe is a significant departure that simplifies the training process while maintaining performance. Quality: -
1. Conciseness of the Methodology Section: The methodology section lacks sufficient depth, particularly in explaining how the model integrates outputs from large language models with Diffuser models for tasks in image generation and editing. Specific details, such as the configuration of attention masks, inputs and target outputs during image generation, and the loss functions employed, are absent. Including these would enhance clarity. 2. Limited Originality in Methodology: Due to its brevity,
1. Developing generative multimodal instruction-following data could be highly valuable for future research and applications. 2. The model shows strong performance across image understanding, image editing, and image generation tasks. 3. The paper is well-presented, making it easy to follow and understand.
1. Although the authors promise to open-source all materials, including data, code, and pre-trained weights, none of these resources are provided for review. Since the dataset is a key contribution of this work, it would strengthen the paper to include these materials in the revised manuscript (with anonymity). I would consider recommending acceptance only if these resources are included in the final version. 2. The novelty of this paper appears limited, as the model seems to be a straightforwar
While the task and the approach are not completely new, the paper proposed several improvements on current approach to build MLLMs, and demonstrated their effectiveness using relatively comprehensive evaluations, resulting in a new model that is strong on both visual understanding and visual generation. The paper is largely clear about the goal, approach and results, though it will be better if more details on the reason of several design choices can be made clear.
The biggest weakness is the incremental nature of the work. For example, the paper claims it unifies visual understanding and generation, but there are similar models such as SEED-X, which is also capable of both, as well as the compared work such as Unified-IO 2. the paper claims they contribute a dataset for instruction tuning, but the dataset is composed of multiple existing datasets. The novelty to me would be insights on why these sets are selected, and why for some datasets such as IPr2
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEducation and Technology Integration · Cognitive and developmental aspects of mathematical skills · Spatial Cognition and Navigation
MethodsSparse Evolutionary Training · LLaMA
