LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Shilong Liu; Hao Cheng; Haotian Liu; Hao Zhang; Feng Li; Tianhe Ren,; Xueyan Zou; Jianwei Yang; Hang Su; Jun Zhu; Lei Zhang; Jianfeng Gao; Chunyuan; Li

arXiv:2311.05437·cs.CV·November 10, 2023·6 cites

LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents

Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren,, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, Chunyuan, Li

PDF

Open Access 1 Repo 9 Models 3 Reviews

TL;DR

LLaVA-Plus is a versatile multimodal assistant that learns to utilize various tools for enhanced visual understanding, generation, and external knowledge retrieval, significantly improving performance and enabling new interaction scenarios.

Contribution

It introduces a general-purpose multimodal model that actively uses a repository of pre-trained tools, expanding capabilities beyond previous models.

Findings

01

Outperforms LLaVA in existing multimodal tasks

02

Demonstrates new capabilities in tool use and visual reasoning

03

Active engagement with image queries improves interaction quality

Abstract

LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.

Peer Reviews

Decision·Submitted to ICLR 2024

Reviewer 01Rating 3· reject, not good enoughConfidence 5

Strengths

The significant contributions mostly lie in the data perspective, while the training algorithm and the model architecture are basically following the previous work. The authors create a new multimodal instruction-following tool using data, integrating lots of real-world tools (skills), like detectors, OCR, image generators, et al. The created dataset is useful to train multimodal language models to possess the ability to use pre-selected tools and perform better on downstream tasks.

Weaknesses

1. Lack of novelty. The paper feels more like an industry paper, which has heavy data engineering work to improve performance, instead of a research paper that has novel insights and approaches compared to previous work. Basically, all the design choices in this paper can be anticipated and do not provide too many insights. 2. Lack of flexibility. If I understand correctly, once new tools/skills are added, the models must be retrained on the augmented dataset to master this new tool. Is this c

Reviewer 02Rating 6· marginally above the acceptance thresholdConfidence 3

Strengths

1. Extensive evaluation: authors evaluate their multimodal tool-based reasoning approach with a large number of tools on existing as well as their own benchmark and compare it with other SOTA LMM approaches. I think their exhaustive evaluation would be useful for the community moving forward. 2. I appreciate the authors' commitment to reproducibility and open-sourcing. 3. The paper is overall well-written and easy to follow.

Weaknesses

1. Limited novelty: Despite extensive eval, the work's novelty is limited especially from the methodology standpoint since neither instruction tuning nor the use of multimodal tools for reasoning are novel methods. 2. Opensource instruction tuning dataset is one the contributions of this work. However, since the dataset itself was not human evaluated and GPT generated, it is unclear how much hallucination does it contain. * would be good to add a limitation section in the paper and discuss this.

Reviewer 03Rating 3· reject, not good enoughConfidence 4

Strengths

I like the approach using the open-weight LLaVA model, which provides a lot more ground for actual meaningful experimentation on the model itself than closed proprietary models like GPT-4. The authors propose a fairly straightforward augmentation to the LLaVA model that on the surface appears to provide an ability to expand its vocabulary of skills by determining which external models to call on for a given task, using visual-linguistic supervision in an AutoGPT-like approach. The paper is als

Weaknesses

Much of the meaningful technical information, such as substantive results, are limited to the appendix. This makes me question the soundness and what the contribution of the paper actually is. We don't even get to the related work until page 6 and so the experiment and results are squeezed at the very end. The word "planning" is used but the paper doesn't really have anything to do with planning as commonly understood. There is no goal to be trained to achieve. The order of which API should b

Code & Models

Repositories

LLaVA-VL/LLaVA-Plus-Codebase
pytorchOfficial

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques