What If We Recaption Billions of Web Images with LLaMA-3?

Xianhang Li; Haoqin Tu; Mude Hui; Zeyu Wang; Bingchen Zhao; Junfei; Xiao; Sucheng Ren; Jieru Mei; Qing Liu; Huangjie Zheng; Yuyin Zhou; Cihang; Xie

arXiv:2406.08478·cs.CV·June 19, 2024·2 cites

What If We Recaption Billions of Web Images with LLaMA-3?

Xianhang Li, Haoqin Tu, Mude Hui, Zeyu Wang, Bingchen Zhao, Junfei, Xiao, Sucheng Ren, Jieru Mei, Qing Liu, Huangjie Zheng, Yuyin Zhou, Cihang, Xie

PDF

Open Access 1 Models 3 Datasets

TL;DR

This paper demonstrates that recaptioning a large-scale web image dataset with an open-source LLaMA-3 model significantly improves the performance of vision-language models in tasks like image retrieval and text-to-image generation.

Contribution

It introduces a scalable recaptioning pipeline using LLaMA-3 to enhance a billion-image dataset, leading to improved model performance across multiple vision-language tasks.

Findings

01

Enhanced zero-shot cross-modal retrieval performance.

02

Improved alignment of generated images with complex text queries.

03

Significant benefits in training vision-language models with Recap-DataComp-1B.

Abstract

Web-crawled image-text pairs are inherently noisy. Prior studies demonstrate that semantically aligning and enriching textual descriptions of these pairs can significantly enhance model training across various vision-language tasks, particularly text-to-image generation. However, large-scale investigations in this area remain predominantly closed-source. Our paper aims to bridge this community effort, leveraging the powerful and \textit{open-sourced} LLaMA-3, a GPT-4 level LLM. Our recaptioning pipeline is simple: first, we fine-tune a LLaMA-3-8B powered LLaVA-1.5 and then employ it to recaption 1.3 billion images from the DataComp-1B dataset. Our empirical results confirm that this enhanced dataset, Recap-DataComp-1B, offers substantial benefits in training advanced vision-language models. For discriminative models like CLIP, we observe enhanced zero-shot performance in cross-modal…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

🤗
UCSC-VLAA/ViT-L-16-HTxt-Recap-CLIP
model· 78k dl· ♡ 17
78k dl♡ 17

Datasets

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMedical Image Segmentation Techniques

MethodsResidual Connection · Softmax · Layer Normalization · Contrastive Language-Image Pre-training · Byte Pair Encoding · Label Smoothing · Diffusion · Adam · Attention Is All You Need · Linear Layer