What Matters in Training a GPT4-Style Language Model with Multimodal Inputs?
Yan Zeng, Hanbo Zhang, Jiani Zheng, Jiangnan Xia, Guoqiang Wei, Yang, Wei, Yuchen Zhang, Tao Kong

TL;DR
This paper systematically studies the impact of design choices on training GPT4-style multimodal models, introducing Lynx, which achieves superior multi-modal understanding and generation capabilities.
Contribution
It provides the first comprehensive evaluation of multimodal training strategies and introduces Lynx, a new model with improved multi-modal understanding and generation.
Findings
Different network structures significantly affect performance.
Data and sampling strategies influence instruction-following ability.
Lynx outperforms existing open-source GPT4-style models in accuracy.
Abstract
Recent advancements in Large Language Models (LLMs) such as GPT4 have displayed exceptional multi-modal capabilities in following open-ended instructions given images. However, the performance of these models heavily relies on design choices such as network structures, training data, and training strategies, and these choices have not been extensively discussed in the literature, making it difficult to quantify progress in this field. To address this issue, this paper presents a systematic and comprehensive study, quantitatively and qualitatively, on training such models. We implement over 20 variants with controlled settings. Concretely, for network structures, we compare different LLM backbones and model designs. For training data, we investigate the impact of data and sampling strategies. For instructions, we explore the influence of diversified prompts on the instruction-following…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Domain Adaptation and Few-Shot Learning
