From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model
Yeong-Joon Ju, Seong-Whan Lee

TL;DR
This paper introduces a data-efficient framework for multimodal embedding that leverages hierarchical prompts and a novel hard negative sampling technique, enabling zero-shot capabilities and competitive performance without extensive pre-training.
Contribution
It proposes a hierarchical embedding prompt and Self-aware Hard Negative Sampling (SaHa) to build robust multimodal embeddings efficiently without large-scale contrastive pre-training.
Findings
Achieves competitive results on the Massive Multimodal Embedding Benchmark.
Reduces data and computational requirements for multimodal embedding.
Enhances zero-shot embedding capabilities through task-level prompts.
Abstract
Adapting generative Multimodal Large Language Models (MLLMs) into universal embedding models typically demands resource-intensive contrastive pre-training, while traditional hard negative mining methods suffer from severe false negative contamination. In this paper, we propose a highly data-efficient framework that bypasses extensive pre-training to build a robust multimodal representation space. We first introduce a hierarchical embedding prompt that provides strong latent conditioning. By explicitly anchoring task definitions at the system level, this prompting strategy effectively bridges the modality gap and unlocks powerful zero-shot embedding capabilities. Building upon this latent conditioning, we present Self-aware Hard Negative Sampling (SaHa). Unlike conventional candidate-space mining, SaHa shifts the mechanism to the query-space by mapping retrieved candidates back to their…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Multimodal Machine Learning Applications
