MM1.5: Methods, Analysis & Insights from Multimodal LLM Fine-tuning
Haotian Zhang, Mingfei Gao, Zhe Gan, Philipp Dufter, Nina Wenzel,, Forrest Huang, Dhruti Shah, Xianzhi Du, Bowen Zhang, Yanghao Li, Sam Dodge,, Keen You, Zhen Yang, Aleksei Timofeev, Mingze Xu, Hong-You Chen,, Jean-Philippe Fauconnier, Zhengfeng Lai, Haoxuan You, Zirui Wang

TL;DR
MM1.5 introduces a family of multimodal large language models with a data-centric training approach, achieving strong performance across various tasks and scales, and providing detailed insights into training strategies.
Contribution
The paper presents MM1.5, a new MLLM family with diverse data strategies, specialized variants, and extensive empirical analysis for improved multimodal understanding.
Findings
Effective data curation enhances model performance.
Small-scale models achieve competitive results.
Specialized variants improve video and UI understanding.
Abstract
We present MM1.5, a new family of multimodal large language models (MLLMs) designed to enhance capabilities in text-rich image understanding, visual referring and grounding, and multi-image reasoning. Building upon the MM1 architecture, MM1.5 adopts a data-centric approach to model training, systematically exploring the impact of diverse data mixtures across the entire model training lifecycle. This includes high-quality OCR data and synthetic captions for continual pre-training, as well as an optimized visual instruction-tuning data mixture for supervised fine-tuning. Our models range from 1B to 30B parameters, encompassing both dense and mixture-of-experts (MoE) variants, and demonstrate that careful data curation and training strategies can yield strong performance even at small scales (1B and 3B). Additionally, we introduce two specialized variants: MM1.5-Video, designed for video…
Peer Reviews
Decision·ICLR 2025 Poster
Clarity: The paper is well written and easy to follow. Significance: Multimodal model development is currently a trending topic, and this work employs a data-centric analysis that offers valuable insights for future researchers. Apart from the most commonly-used models, the examined model is able to comprehend about the referred objects and process multiple images.
A drawback of this study is that it serves as a comprehensive ablation analysis without offering any novel methodologies or tasks/benchmarks. I don't have any suggestion to improve this aspect at this point, but it should not be any problem.
1. This paper conducts thorough experiments to explore the design of data, training, and architecture. 2. This paper provides several empirical guidance for training MLLMs. The data mixture ratio during the pre-training stage (50:10:40 for image-text, interleaved image-text, and text-only data) could be particularly important, as it highlights the significance of text understanding in complex multimodal scenarios. 3. This paper is clearly and concisely written, making each concept easy to unders
This paper currently resembles a technical report more than academic research. It employs several well-known techniques (e.g. MoE, Dynamic Res) and mainly investigates the data mixture ratio at each training stage. While the empirical findings are valuable, additional theoretical insights would be beneficial. I have two concerns below: - Why does the optimal ratio of text-only data change between the pre-training and SFT stages? Does the role of text-only data differ across these stages? - Giv
1. This paper conducts extensive empirical studies and ablations on continual pre-training, dynamic high-resolution image processing, and curation of our supervised fine-tuning datasets, which can provide insights and experience for future research on large-scale MLLMs. 2. This paper presents a set of MLLMs, not only scaling from 1B to 30B but also exploring MoE variants (1B and 3B), which can provide important scaling insights for future research on large-scale MLLMs. 3. This paper shows strong
1. Although this paper presents detailed empirical ablations on several important engineering decisions in data mixture and high-resolution image processing, it does not provide strong new insights or findings compared to previous works. The ablations on SFT data mixture, continual pre-training, and dynamic high-resolution image processing are not novel and have been explored in previous works. 2. Specifically, data-specific hyperparameters from detailed ablations for data mixture (pre-training,
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Computational Techniques and Applications
