Revisit Large-Scale Image-Caption Data in Pre-training Multimodal   Foundation Models

Zhengfeng Lai; Vasileios Saveris; Chen Chen; Hong-You Chen; Haotian; Zhang; Bowen Zhang; Juan Lao Tebar; Wenze Hu; Zhe Gan; Peter Grasch; Meng; Cao; Yinfei Yang

arXiv:2410.02740·cs.CV·October 4, 2024

Revisit Large-Scale Image-Caption Data in Pre-training Multimodal Foundation Models

Zhengfeng Lai, Vasileios Saveris, Chen Chen, Hong-You Chen, Haotian, Zhang, Bowen Zhang, Juan Lao Tebar, Wenze Hu, Zhe Gan, Peter Grasch, Meng, Cao, Yinfei Yang

PDF

Open Access 1 Repo

TL;DR

This paper investigates the roles and interactions of synthetic captions and AltTexts in pre-training multimodal models, proposing a scalable captioning pipeline that enhances model performance by tailoring caption formats to specific model preferences.

Contribution

It introduces a controllable, scalable captioning pipeline and systematically analyzes the effects of different caption formats on various multimodal models, revealing optimal strategies for pre-training.

Findings

01

Hybrid captioning approaches outperform synthetic-only methods.

02

Different models prefer specific caption formats for optimal performance.

03

Combining AltTexts with synthetic captions improves image-text alignment.

Abstract

Recent advancements in multimodal models highlight the value of rewritten captions for improving performance, yet key challenges remain. For example, while synthetic captions often provide superior quality and image-text alignment, it is not clear whether they can fully replace AltTexts: the role of synthetic captions and their interaction with original web-crawled AltTexts in pre-training is still not well understood. Moreover, different multimodal foundation models may have unique preferences for specific caption formats, but efforts to identify the optimal captions for each model remain limited. In this work, we propose a novel, controllable, and scalable captioning pipeline designed to generate diverse caption formats tailored to various multimodal models. By examining Short Synthetic Captions (SSC) towards Dense Synthetic Captions (DSC+) as case studies, we systematically explore…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

apple/ml-veclip
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Human Pose and Action Recognition

MethodsContrastive Language-Image Pre-training · Diffusion