Scalable Vision Language Model Training via High Quality Data Curation
Hongyuan Dong, Zijian Kang, Weijie Yin, Xiao Liang, Chao Feng, Jiao Ran

TL;DR
This paper presents SAIL-VL, a scalable vision language model series that achieves state-of-the-art performance through high-quality data curation, extensive pretraining, and effective supervised fine-tuning, setting new benchmarks in visual understanding.
Contribution
Introduction of SAIL-VL, a vision language model series that leverages high-quality data construction, large-scale pretraining, and advanced fine-tuning techniques for superior performance.
Findings
SAIL-Caption dataset has the highest quality among open-source datasets.
2B SAIL-VL model achieves top scores on 18 VLM benchmarks.
Scaling data size and complexity improves model performance significantly.
Abstract
In this paper, we introduce SAIL-VL (ScAlable Vision Language Model TraIning via High QuaLity Data Curation), an open-source vision language model (VLM) series achieving state-of-the-art (SOTA) performance in 2B and 8B parameters. The following three key improvements contribute to SAIL-VL's leading performance: (1) Scalable high-quality visual understanding data construction: We implement a data construction pipeline to enable hundred-million-scale high-quality recaption data annotation. The resulted dataset SAIL-Caption is validated to be of the highest data quality compared with opensource datasets. (2) Scalable Pretraining with High-Quality Visual Understanding Data: We scale SAIL-VL's pretraining budget up to 655B tokens and show that even a 2B VLM benefits from scaled up training data sizes, exhibiting logarithmic data size scaling laws in benchmark performance. (3) Scalable SFT…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
- 🤗BytedanceDouyinContent/SAIL-VL-2Bmodel· 32 dl· ♡ 3332 dl♡ 33
- 🤗BytedanceDouyinContent/SAIL-VL-8Bmodel· 5 dl· ♡ 45 dl♡ 4
- 🤗BytedanceDouyinContent/SAIL-VL-1d5-2Bmodel· 35 dl· ♡ 1235 dl♡ 12
- 🤗BytedanceDouyinContent/SAIL-VL-1d5-8Bmodel· 5 dl· ♡ 25 dl♡ 2
- 🤗BytedanceDouyinContent/SAIL-VL-1d6-8Bmodel· 36k dl· ♡ 1536k dl♡ 15
- 🤗BytedanceDouyinContent/SAIL-VL2-2Bmodel· 1.9k dl· ♡ 111.9k dl♡ 11
- 🤗BytedanceDouyinContent/SAIL-VL2-8Bmodel· 755 dl· ♡ 13755 dl♡ 13
- 🤗BytedanceDouyinContent/SAIL-VL2-2B-Thinkingmodel· 6 dl· ♡ 36 dl♡ 3
- 🤗BytedanceDouyinContent/SAIL-VL2-8B-Thinkingmodel· 11 dl· ♡ 711 dl♡ 7
Videos
Taxonomy
TopicsSemantic Web and Ontologies · Advanced Image and Video Retrieval Techniques · Image Retrieval and Classification Techniques
MethodsShrink and Fine-Tune
