Generative Visual Instruction Tuning

Jefferson Hernandez; Ruben Villegas; Vicente Ordonez

arXiv:2406.11262·cs.CV·October 4, 2024

Generative Visual Instruction Tuning

Jefferson Hernandez, Ruben Villegas, Vicente Ordonez

PDF

Open Access 1 Repo 3 Reviews

TL;DR

This paper introduces GenLLaVA, a multimodal model trained with a new instruction set combining GPT-4V generated data and existing datasets, enhancing zero-shot visual understanding, generation, and editing capabilities.

Contribution

It presents a novel instruction tuning strategy that integrates multiple pretrained models to create a versatile visual assistant, outperforming previous models like LLaVA.

Findings

01

GenLLaVA surpasses LLaVA in visual understanding tasks.

02

Achieves competitive results with models like Unified-IO 2.

03

Open-sources dataset, code, and checkpoints for community use.

Abstract

We propose to use automatically generated instruction-following data to improve the zero-shot capabilities of a large multimodal model with additional support for generative and image editing tasks. We achieve this by curating a new multimodal instruction-following set using GPT-4V and existing datasets for image generation and editing. Using this instruction set and the existing LLaVA-Finetune instruction set for visual understanding tasks, we produce GenLLaVA, a Generative Large Language and Visual Assistant. GenLLaVA is built through a strategy that combines three types of large pretrained models through instruction finetuning: Mistral for language modeling, SigLIP for image-text matching, and StableDiffusion for text-to-image generation. Our model demonstrates visual understanding capabilities superior to LLaVA and additionally demonstrates competitive results with native multimodal…

Peer Reviews

Decision·Submitted to ICLR 2025

Reviewer 01Rating 3Confidence 5

Strengths

Originality: - New Dataset Curation: The creation of a new multimodal instruction-following dataset that amalgamates image understanding, generation, and editing is innovative. It addresses the need for diverse training data to support complex multimodal tasks. - Single-Stage Training Strategy: Moving away from the traditional multi-stage training pipeline to a single-stage training recipe is a significant departure that simplifies the training process while maintaining performance. Quality: -

Weaknesses

1. Conciseness of the Methodology Section: The methodology section lacks sufficient depth, particularly in explaining how the model integrates outputs from large language models with Diffuser models for tasks in image generation and editing. Specific details, such as the configuration of attention masks, inputs and target outputs during image generation, and the loss functions employed, are absent. Including these would enhance clarity. 2. Limited Originality in Methodology: Due to its brevity,

Reviewer 02Rating 5Confidence 5

Strengths

1. Developing generative multimodal instruction-following data could be highly valuable for future research and applications. 2. The model shows strong performance across image understanding, image editing, and image generation tasks. 3. The paper is well-presented, making it easy to follow and understand.

Weaknesses

1. Although the authors promise to open-source all materials, including data, code, and pre-trained weights, none of these resources are provided for review. Since the dataset is a key contribution of this work, it would strengthen the paper to include these materials in the revised manuscript (with anonymity). I would consider recommending acceptance only if these resources are included in the final version. 2. The novelty of this paper appears limited, as the model seems to be a straightforwar

Reviewer 03Rating 6Confidence 3

Strengths

While the task and the approach are not completely new, the paper proposed several improvements on current approach to build MLLMs, and demonstrated their effectiveness using relatively comprehensive evaluations, resulting in a new model that is strong on both visual understanding and visual generation. The paper is largely clear about the goal, approach and results, though it will be better if more details on the reason of several design choices can be made clear.

Weaknesses

The biggest weakness is the incremental nature of the work. For example, the paper claims it unifies visual understanding and generation, but there are similar models such as SEED-X, which is also capable of both, as well as the compared work such as Unified-IO 2. the paper claims they contribute a dataset for instruction tuning, but the dataset is composed of multiple existing datasets. The novelty to me would be insights on why these sets are selected, and why for some datasets such as IPr2

Code & Models

Repositories

jeffhernandez1995/GenLlaVA
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsEducation and Technology Integration · Cognitive and developmental aspects of mathematical skills · Spatial Cognition and Navigation

MethodsSparse Evolutionary Training · LLaMA