Enhancing Fine-Grained Vision-Language Pretraining with Negative   Augmented Samples

Yeyuan Wang; Dehong Gao; Lei Yi; Linbo Jin; Jinxia Zhang; Libin Yang,; Xiaoyan Cai

arXiv:2412.10029·cs.CV·December 16, 2024

Enhancing Fine-Grained Vision-Language Pretraining with Negative Augmented Samples

Yeyuan Wang, Dehong Gao, Lei Yi, Linbo Jin, Jinxia Zhang, Libin Yang,, Xiaoyan Cai

PDF

1 Video

TL;DR

This paper introduces Negative Augmented Samples (NAS), a novel pretraining approach that improves fine-grained vision-language understanding by generating challenging negative samples and using a visual dictionary to bridge modalities.

Contribution

The paper proposes NAS, a new method that enhances fine-grained perception in vision-language models through negative sample augmentation and a visual dictionary for better cross-modal alignment.

Findings

01

NAS significantly improves fine-grained task performance.

02

Negative visual augmentation creates more challenging training samples.

03

Experiments validate the effectiveness of NAS components.

Abstract

Existing Vision-Language Pretraining (VLP) methods have achieved remarkable improvements across a variety of vision-language tasks, confirming their effectiveness in capturing coarse-grained semantic correlations. However, their capability for fine-grained understanding, which is critical for many nuanced vision-language applications, remains limited. Prevailing VLP models often overlook the intricate distinctions in expressing different modal features and typically depend on the similarity of holistic features for cross-modal interactions. Moreover, these models directly align and integrate features from different modalities, focusing more on coarse-grained general representations, thus failing to capture the nuanced differences necessary for tasks demanding a more detailed perception. In response to these limitations, we introduce Negative Augmented Samples(NAS), a refined…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Enhancing Fine-grained Vision-Language Pretraining with Negative Augmented Samples· underline

Taxonomy

MethodsALIGN