Investigating the Scaling Effect of Instruction Templates for Training Multimodal Language Model
Shijian Wang, Linxin Song, Jieyu Zhang, Ryotaro Shimizu, Jiarui Jin,, Ao Luo, Yuan Lu, Li Yao, Cunjian Chen, Julian McAuley, Wentao Zhang, Hanqian, Wu

TL;DR
This paper explores how the number of instruction templates affects multimodal language model training, finding that medium-scale templates yield optimal performance and significant gains over original data.
Contribution
It introduces a programmatic template generator to systematically study the scaling effect of instruction templates on MLM training performance.
Findings
Optimal performance at medium template scale
Up to 10% performance improvement with data augmentation
Best results compared to similar-scale models trained on larger data
Abstract
Current multimodal language model (MLM) training approaches overlook the influence of instruction templates. Previous research deals with this problem by leveraging hand-crafted or model-generated templates, failing to investigate the scaling effect of instruction templates on MLM training. In this work, we propose a programmatic instruction template generator capable of producing over 15K unique instruction templates by filling randomly sampled positional synonyms into weighted sampled meta templates, enabling us to comprehensively explore MLM's performance across various template scales in the training process. Our investigation into scaling instruction templates for MLM training demonstrates that MLM capabilities do not consistently improve with increasing template scale. Instead, optimal performance is achieved at a medium template scale. Models trained with data augmented at the…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSecond Language Acquisition and Learning · Speech and dialogue systems · Natural Language Processing Techniques
