CapsFusion: Rethinking Image-Text Data at Scale
Qiying Yu, Quan Sun, Xiaosong Zhang, Yufeng Cui, Fan Zhang, Yue Cao,, Xinlong Wang, Jingjing Liu

TL;DR
CapsFusion is a novel framework that enhances multimodal pretraining data by refining web-based and synthetic image-text captions, leading to improved model performance, efficiency, and knowledge retention.
Contribution
It introduces CapsFusion, a framework using large language models to improve caption quality, addressing scalability and knowledge loss issues in multimodal training.
Findings
CapsFusion captions outperform existing ones in benchmark scores.
CapsFusion reduces training computation by 11-16 times.
CapsFusion enhances world knowledge and scalability of models.
Abstract
Large multimodal models demonstrate remarkable generalist ability to perform diverse multimodal tasks in a zero-shot manner. Large-scale web-based image-text pairs contribute fundamentally to this success, but suffer from excessive noise. Recent studies use alternative captions synthesized by captioning models and have achieved notable benchmark performance. However, our experiments reveal significant Scalability Deficiency and World Knowledge Loss issues in models trained with synthetic captions, which have been largely obscured by their initial benchmark success. Upon closer examination, we identify the root cause as the overly-simplified language structure and lack of knowledge details in existing synthetic captions. To provide higher-quality and more scalable multimodal pretraining data, we propose CapsFusion, an advanced framework that leverages large language models to consolidate…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
