LLaVA-Plus: Learning to Use Tools for Creating Multimodal Agents
Shilong Liu, Hao Cheng, Haotian Liu, Hao Zhang, Feng Li, Tianhe Ren,, Xueyan Zou, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang, Jianfeng Gao, Chunyuan, Li

TL;DR
LLaVA-Plus is a versatile multimodal assistant that learns to utilize various tools for enhanced visual understanding, generation, and external knowledge retrieval, significantly improving performance and enabling new interaction scenarios.
Contribution
It introduces a general-purpose multimodal model that actively uses a repository of pre-trained tools, expanding capabilities beyond previous models.
Findings
Outperforms LLaVA in existing multimodal tasks
Demonstrates new capabilities in tool use and visual reasoning
Active engagement with image queries improves interaction quality
Abstract
LLaVA-Plus is a general-purpose multimodal assistant that expands the capabilities of large multimodal models. It maintains a skill repository of pre-trained vision and vision-language models and can activate relevant tools based on users' inputs to fulfill real-world tasks. LLaVA-Plus is trained on multimodal instruction-following data to acquire the ability to use tools, covering visual understanding, generation, external knowledge retrieval, and compositions. Empirical results show that LLaVA-Plus outperforms LLaVA in existing capabilities and exhibits new ones. It is distinct in that the image query is directly grounded and actively engaged throughout the entire human-AI interaction sessions, significantly improving tool use performance and enabling new scenarios.
Peer Reviews
Decision·Submitted to ICLR 2024
The significant contributions mostly lie in the data perspective, while the training algorithm and the model architecture are basically following the previous work. The authors create a new multimodal instruction-following tool using data, integrating lots of real-world tools (skills), like detectors, OCR, image generators, et al. The created dataset is useful to train multimodal language models to possess the ability to use pre-selected tools and perform better on downstream tasks.
1. Lack of novelty. The paper feels more like an industry paper, which has heavy data engineering work to improve performance, instead of a research paper that has novel insights and approaches compared to previous work. Basically, all the design choices in this paper can be anticipated and do not provide too many insights. 2. Lack of flexibility. If I understand correctly, once new tools/skills are added, the models must be retrained on the augmented dataset to master this new tool. Is this c
1. Extensive evaluation: authors evaluate their multimodal tool-based reasoning approach with a large number of tools on existing as well as their own benchmark and compare it with other SOTA LMM approaches. I think their exhaustive evaluation would be useful for the community moving forward. 2. I appreciate the authors' commitment to reproducibility and open-sourcing. 3. The paper is overall well-written and easy to follow.
1. Limited novelty: Despite extensive eval, the work's novelty is limited especially from the methodology standpoint since neither instruction tuning nor the use of multimodal tools for reasoning are novel methods. 2. Opensource instruction tuning dataset is one the contributions of this work. However, since the dataset itself was not human evaluated and GPT generated, it is unclear how much hallucination does it contain. * would be good to add a limitation section in the paper and discuss this.
I like the approach using the open-weight LLaVA model, which provides a lot more ground for actual meaningful experimentation on the model itself than closed proprietary models like GPT-4. The authors propose a fairly straightforward augmentation to the LLaVA model that on the surface appears to provide an ability to expand its vocabulary of skills by determining which external models to call on for a given task, using visual-linguistic supervision in an AutoGPT-like approach. The paper is als
Much of the meaningful technical information, such as substantive results, are limited to the appendix. This makes me question the soundness and what the contribution of the paper actually is. We don't even get to the related work until page 6 and so the experiment and results are squeezed at the very end. The word "planning" is used but the paper doesn't really have anything to do with planning as commonly understood. There is no goal to be trained to achieve. The order of which API should b
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
