An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models
Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jianfeng Gao,, Yelong Shen

TL;DR
This paper empirically investigates the effects of scaling, data mixing, and training methods on large multimodal models' performance, demonstrating consistent improvements and providing insights for future research.
Contribution
It presents the first large-scale study of multimodal models up to 70B parameters, evaluating various training techniques and data strategies for visual instruction tuning.
Findings
Scaling models improves performance and language capabilities.
LoRA/QLoRA tuning matches full-model fine-tuning performance.
Higher image resolution and data mixing enhance model effectiveness.
Abstract
Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters or smaller. In this paper we present an empirical study of scaling LLaVA up to 33B and 65B/70B, and share our findings from our explorations in image resolution, data mixing and parameter-efficient training methods such as LoRA/QLoRA. These are evaluated by their impact on the multi-modal and language capabilities when completing real-world tasks in the wild. We find that scaling LMM consistently enhances model performance and improves language capabilities, and performance of LoRA/QLoRA tuning of LMM are comparable to the performance of full-model fine-tuning. Additionally, the study highlights the importance of higher image resolutions and mixing…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗saurabh-straive/llava_100k_finetunedmodel
- 🤗Straive/llava-1.5-13b-lora-100k-8-marmodel
- 🤗saurabh-straive/llava-1-5model
- 🤗GDinesh/llava-1-5model
- 🤗starriver030515/LLaVAmodel
- 🤗mylesgoose/Llama-3.1-Minitron-4B-Llava-Nvidia-siglip-ovmodel· ♡ 1♡ 1
- 🤗gradguy/model1model· ♡ 1♡ 1
- 🤗chouss/llava-spatmodel
- 🤗zooblastlbz/id-alignmodel
- 🤗YuqianFu/LLaVAmodel
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Natural Language Processing Techniques · Topic Modeling
