Scalable Vision Language Model Training via High Quality Data Curation

Hongyuan Dong; Zijian Kang; Weijie Yin; Xiao Liang; Chao Feng; Jiao Ran

arXiv:2501.05952·cs.CV·June 10, 2025

Scalable Vision Language Model Training via High Quality Data Curation

Hongyuan Dong, Zijian Kang, Weijie Yin, Xiao Liang, Chao Feng, Jiao Ran

PDF

Open Access 9 Models 1 Video

TL;DR

This paper presents SAIL-VL, a scalable vision language model series that achieves state-of-the-art performance through high-quality data curation, extensive pretraining, and effective supervised fine-tuning, setting new benchmarks in visual understanding.

Contribution

Introduction of SAIL-VL, a vision language model series that leverages high-quality data construction, large-scale pretraining, and advanced fine-tuning techniques for superior performance.

Findings

01

SAIL-Caption dataset has the highest quality among open-source datasets.

02

2B SAIL-VL model achieves top scores on 18 VLM benchmarks.

03

Scaling data size and complexity improves model performance significantly.

Abstract

In this paper, we introduce SAIL-VL (ScAlable Vision Language Model TraIning via High QuaLity Data Curation), an open-source vision language model (VLM) series achieving state-of-the-art (SOTA) performance in 2B and 8B parameters. The following three key improvements contribute to SAIL-VL's leading performance: (1) Scalable high-quality visual understanding data construction: We implement a data construction pipeline to enable hundred-million-scale high-quality recaption data annotation. The resulted dataset SAIL-Caption is validated to be of the highest data quality compared with opensource datasets. (2) Scalable Pretraining with High-Quality Visual Understanding Data: We scale SAIL-VL's pretraining budget up to 655B tokens and show that even a 2B VLM benefits from scaled up training data sizes, exhibiting logarithmic data size scaling laws in benchmark performance. (3) Scalable SFT…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

Scalable Vision Language Model Training via High Quality Data Curation· underline

Taxonomy

TopicsSemantic Web and Ontologies · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques

MethodsShrink and Fine-Tune