Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models

Xin Huang; Ruibin Li; Tong Jia; Wei Zheng; Ya Wang

arXiv:2505.15576·cs.CV·August 29, 2025

Visual Perturbation and Adaptive Hard Negative Contrastive Learning for Compositional Reasoning in Vision-Language Models

Xin Huang, Ruibin Li, Tong Jia, Wei Zheng, Ya Wang

PDF

Open Access 1 Repo

TL;DR

This paper introduces AHNPL, a novel training method for vision-language models that enhances compositional reasoning by generating and utilizing image-based hard negatives through adaptive contrastive learning techniques.

Contribution

It proposes translating text-based negatives into the visual domain and employs a dynamic margin loss to improve discrimination of challenging sample pairs.

Findings

01

Significant performance improvements on three public datasets.

02

Effective enhancement of visual encoder training.

03

Better discrimination of hard negative samples.

Abstract

Vision-Language Models (VLMs) are essential for multimodal tasks, especially compositional reasoning (CR) tasks, which require distinguishing fine-grained semantic differences between visual and textual embeddings. However, existing methods primarily fine-tune the model by generating text-based hard negative samples, neglecting the importance of image-based negative samples, which results in insufficient training of the visual encoder and ultimately impacts the overall performance of the model. Moreover, negative samples are typically treated uniformly, without considering their difficulty levels, and the alignment of positive samples is insufficient, which leads to challenges in aligning difficult sample pairs. To address these issues, we propose Adaptive Hard Negative Perturbation Learning (AHNPL). AHNPL translates text-based hard negatives into the visual domain to generate…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

nynu-bdai/ahnpl
noneOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications

MethodsContrastive Learning