Improving fine-grained understanding in image-text pre-training

Ioana Bica; Anastasija Ili\'c; Matthias Bauer; Goker Erdogan; Matko; Bo\v{s}njak; Christos Kaplanis; Alexey A. Gritsenko; Matthias Minderer,; Charles Blundell; Razvan Pascanu; Jovana Mitrovi\'c

arXiv:2401.09865·cs.CV·January 19, 2024·1 cites

Improving fine-grained understanding in image-text pre-training

Ioana Bica, Anastasija Ili\'c, Matthias Bauer, Goker Erdogan, Matko, Bo\v{s}njak, Christos Kaplanis, Alexey A. Gritsenko, Matthias Minderer,, Charles Blundell, Razvan Pascanu, Jovana Mitrovi\'c

PDF

Open Access

TL;DR

SPARC is a novel pretraining method that enhances fine-grained multimodal representations by learning sparse, token-specific image patch groupings, improving performance on both coarse and fine-grained vision-language tasks.

Contribution

The paper introduces SPARC, a simple yet effective approach for fine-grained contrastive alignment that learns sparse, token-specific image patch groupings without requiring batch negatives.

Findings

01

Improves performance on image classification, retrieval, detection, and segmentation tasks.

02

Enhances model faithfulness and captioning quality.

03

Achieves better fine-grained understanding compared to existing methods.

Abstract

We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs. Given that multiple image patches often correspond to single words, we propose to learn a grouping of image patches for every token in the caption. To achieve this, we use a sparse similarity metric between image patches and language tokens and compute for each token a language-grouped vision embedding as the weighted average of patches. The token and language-grouped vision embeddings are then contrasted through a fine-grained sequence-wise loss that only depends on individual samples and does not require other batch samples as negatives. This enables more detailed information to be learned in a computationally inexpensive manner. SPARC combines this fine-grained loss with a contrastive loss between global image and text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling