What If We Recaption Billions of Web Images with LLaMA-3?
Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei, Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang, Xie

TL;DR
This paper demonstrates that recaptioning a large-scale web image dataset with an open-source LLaMA-3 model significantly improves the performance of vision-language models in tasks like image retrieval and text-to-image generation.
Contribution
It introduces a scalable recaptioning pipeline using LLaMA-3 to enhance a billion-image dataset, leading to improved model performance across multiple vision-language tasks.
Findings
Enhanced zero-shot cross-modal retrieval performance.
Improved alignment of generated images with complex text queries.
Significant benefits in training vision-language models with Recap-DataComp-1B.
Abstract
Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMedical Image Segmentation Techniques
MethodsResidual Connection · Softmax · Layer Normalization · Contrastive Language-Image Pre-training · Byte Pair Encoding · Label Smoothing · Diffusion · Adam · Attention Is All You Need · Linear Layer
