Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models
Zhengfeng Lai, Vasileios Saveris, Chen Chen, Hong-You Chen, Haotian, Zhang, Bowen Zhang, Juan Lao Tebar, Wenze Hu, Zhe Gan, Peter Grasch, Meng, Cao, Yinfei Yang

TL;DR
This paper investigates the roles and interactions of synthetic captions and AltTexts in pre-training multimodal models, proposing a scalable captioning pipeline that enhances model performance by tailoring caption formats to specific model preferences.
Contribution
It introduces a controllable, scalable captioning pipeline and systematically analyzes the effects of different caption formats on various multimodal models, revealing optimal strategies for pre-training.
Findings
Hybrid captioning approaches outperform synthetic-only methods.
Different models prefer specific caption formats for optimal performance.
Combining AltTexts with synthetic captions improves image-text alignment.
Abstract
Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition
MethodsContrastive Language-Image Pre-training · Diffusion
