Accelerating Pre-training of Multimodal LLMs via Chain-of-Sight
Ziyuan Huang, Kaixiang Ji, Biao Gong, Zhiwu Qing, Qinglong Zhang,, Kecheng Zheng, Jian Wang, Jingdong Chen, Ming Yang

TL;DR
This paper presents Chain-of-Sight, a novel vision-language module that accelerates multimodal large language model pre-training by reducing visual tokens and employing a multi-scale visual resampling architecture, achieving 73% faster training.
Contribution
It introduces Chain-of-Sight, a new module that effectively accelerates pre-training of MLLMs by reducing visual tokens and leveraging a multi-scale visual context, without performance loss.
Findings
Pre-training time reduced by ~73%.
Matching or surpassing standard performance on benchmarks.
Enabling up to 16x increase in visual token count.
Abstract
This paper introduces Chain-of-Sight, a vision-language bridge module that accelerates the pre-training of Multimodal Large Language Models (MLLMs). Our approach employs a sequence of visual resamplers that capture visual details at various spacial scales. This architecture not only leverages global and local visual contexts effectively, but also facilitates the flexible extension of visual tokens through a compound token scaling strategy, allowing up to a 16x increase in the token count post pre-training. Consequently, Chain-of-Sight requires significantly fewer visual tokens in the pre-training phase compared to the fine-tuning phase. This intentional reduction of visual tokens during pre-training notably accelerates the pre-training process, cutting down the wall-clock training time by ~73%. Empirical results on a series of vision-language benchmarks reveal that the pre-train…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsNatural Language Processing Techniques · Speech and dialogue systems · Software Testing and Debugging Techniques
