TL;DR
This paper introduces Negative Augmented Samples (NAS), a novel pretraining approach that improves fine-grained vision-language understanding by generating challenging negative samples and using a visual dictionary to bridge modalities.
Contribution
The paper proposes NAS, a new method that enhances fine-grained perception in vision-language models through negative sample augmentation and a visual dictionary for better cross-modal alignment.
Findings
NAS significantly improves fine-grained task performance.
Negative visual augmentation creates more challenging training samples.
Experiments validate the effectiveness of NAS components.
Abstract
Existing Vision-Language Pretraining (VLP) methods have achieved remarkable improvements across a variety of vision-language tasks, confirming their effectiveness in capturing coarse-grained semantic correlations. However, their capability for fine-grained understanding, which is critical for many nuanced vision-language applications, remains limited. Prevailing VLP models often overlook the intricate distinctions in expressing different modal features and typically depend on the similarity of holistic features for cross-modal interactions. Moreover, these models directly align and integrate features from different modalities, focusing more on coarse-grained general representations, thus failing to capture the nuanced differences necessary for tasks demanding a more detailed perception. In response to these limitations, we introduce Negative Augmented Samples(NAS), a refined…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
MethodsALIGN
