Improve Multi-Modal Embedding Learning via Explicit Hard Negative Gradient Amplifying
Youze Xue, Dian Li, Gang Liu

TL;DR
This paper analyzes the role of hard negatives in contrastive learning for multi-modal models and proposes an explicit gradient amplification method to improve embedding discriminability, achieving state-of-the-art results.
Contribution
It introduces a novel method to explicitly amplify hard negative gradients, enhancing multi-modal embedding learning beyond existing hard negative mining strategies.
Findings
Achieves state-of-the-art performance on MMEB benchmark.
Top rank on MMEB leaderboard with integrated MLLM.
Demonstrates effectiveness of explicit gradient amplification.
Abstract
With the rapid advancement of multi-modal large language models (MLLMs) in recent years, the foundational Contrastive Language-Image Pretraining (CLIP) framework has been successfully extended to MLLMs, enabling more powerful and universal multi-modal embeddings for a wide range of retrieval tasks. Despite these developments, the core contrastive learning paradigm remains largely unchanged from CLIP-style models to MLLMs. Within this framework, the effective mining of hard negative samples continues to be a critical factor for enhancing performance. Prior works have introduced both offline and online strategies for hard negative mining to improve the efficiency of contrastive learning. While these approaches have led to improved multi-modal embeddings, the specific contribution of each hard negative sample to the learning process has not been thoroughly investigated. In this work, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMachine Learning and ELM · Domain Adaptation and Few-Shot Learning
