Infinity-MM: Scaling Multimodal Performance with Large-Scale and   High-Quality Instruction Data

Shuhao Gu; Jialing Zhang; Siyuan Zhou; Kevin Yu; Zhaohu Xing,; Liangdong Wang; Zhou Cao; Jintao Jia; Zhuoyi Zhang; Yixuan Wang; Zhenchong; Hu; Bo-Wen Zhang; Jijie Li; Dong Liang; Yingli Zhao; Songjing Wang; Yulong; Ao; Yiming Ju; Huanhuan Ma; Xiaotong Li; Haiwen Diao; Yufeng Cui; Xinlong; Wang; Yaoqi Liu; Fangxiang Feng; Guang Liu

arXiv:2410.18558·cs.CL·January 7, 2025

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing,, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, Zhenchong, Hu, Bo-Wen Zhang, Jijie Li, Dong Liang, Yingli Zhao, Songjing Wang, Yulong, Ao, Yiming Ju, Huanhuan Ma, Xiaotong Li, Haiwen Diao, Yufeng Cui

PDF

Open Access 4 Repos 3 Models 1 Datasets

TL;DR

Infinity-MM introduces a large-scale, high-quality multimodal instruction dataset and a synthetic data generation method, enabling training of a 2-billion-parameter VLM that achieves state-of-the-art results in multimodal tasks.

Contribution

The paper presents a new extensive multimodal instruction dataset and a synthetic data generation approach, significantly improving VLM performance and scalability.

Findings

01

Achieved SOTA performance with Aquila-VL-2B on multimodal tasks.

02

Created over 40 million high-quality instruction samples.

03

Demonstrated effective large-scale instruction data synthesis.

Abstract

Recently, Vision-Language Models (VLMs) have achieved remarkable progress in multimodal tasks, and multimodal instruction data serves as the foundation for enhancing VLM capabilities. Despite the availability of several open-source multimodal datasets, limitations in the scale and quality of open-source instruction data hinder the performance of VLMs trained on these datasets, leading to a significant gap compared to models trained on closed-source data. To address this challenge, we introduce Infinity-MM, a large-scale multimodal instruction dataset. We collected the available multimodal instruction datasets and performed unified preprocessing, resulting in a dataset with over 40 million samples that ensures diversity and accuracy. Furthermore, to enable large-scale expansion of instruction data and support the continuous acquisition of high-quality data, we propose a synthetic…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Models

Datasets

BAAI/Infinity-MM
dataset· 24k dl
24k dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsVideo Analysis and Summarization