Improving fine-grained understanding in image-text pre-training
Ioana Bica, Anastasija Ili\'c, Matthias Bauer, Goker Erdogan, Matko, Bo\v{s}njak, Christos Kaplanis, Alexey A. Gritsenko, Matthias Minderer,, Charles Blundell, Razvan Pascanu, Jovana Mitrovi\'c

TL;DR
SPARC is a novel pretraining method that enhances fine-grained multimodal representations by learning sparse, token-specific image patch groupings, improving performance on both coarse and fine-grained vision-language tasks.
Contribution
The paper introduces SPARC, a simple yet effective approach for fine-grained contrastive alignment that learns sparse, token-specific image patch groupings without requiring batch negatives.
Findings
Improves performance on image classification, retrieval, detection, and segmentation tasks.
Enhances model faithfulness and captioning quality.
Achieves better fine-grained understanding compared to existing methods.
Abstract
We introduce SPARse Fine-grained Contrastive Alignment (SPARC), a simple method for pretraining more fine-grained multimodal representations from image-text pairs. Given that multiple image patches often correspond to single words, we propose to learn a grouping of image patches for every token in the caption. To achieve this, we use a sparse similarity metric between image patches and language tokens and compute for each token a language-grouped vision embedding as the weighted average of patches. The token and language-grouped vision embeddings are then contrasted through a fine-grained sequence-wise loss that only depends on individual samples and does not require other batch samples as negatives. This enables more detailed information to be learned in a computationally inexpensive manner. SPARC combines this fine-grained loss with a contrastive loss between global image and text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
