From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

Yeong-Joon Ju; Seong-Whan Lee

arXiv:2508.00955·cs.LG·March 2, 2026

From Generator to Embedder: Harnessing Innate Abilities of Multimodal LLMs via Building Zero-Shot Discriminative Embedding Model

Yeong-Joon Ju, Seong-Whan Lee

PDF

Open Access 2 Models

TL;DR

This paper introduces a data-efficient framework for multimodal embedding that leverages hierarchical prompts and a novel hard negative sampling technique, enabling zero-shot capabilities and competitive performance without extensive pre-training.

Contribution

It proposes a hierarchical embedding prompt and Self-aware Hard Negative Sampling (SaHa) to build robust multimodal embeddings efficiently without large-scale contrastive pre-training.

Findings

01

Achieves competitive results on the Massive Multimodal Embedding Benchmark.

02

Reduces data and computational requirements for multimodal embedding.

03

Enhances zero-shot embedding capabilities through task-level prompts.

Abstract

Adapting generative Multimodal Large Language Models (MLLMs) into universal embedding models typically demands resource-intensive contrastive pre-training, while traditional hard negative mining methods suffer from severe false negative contamination. In this paper, we propose a highly data-efficient framework that bypasses extensive pre-training to build a robust multimodal representation space. We first introduce a hierarchical embedding prompt that provides strong latent conditioning. By explicitly anchoring task definitions at the system level, this prompting strategy effectively bridges the modality gap and unlocks powerful zero-shot embedding capabilities. Building upon this latent conditioning, we present Self-aware Hard Negative Sampling (SaHa). Unlike conventional candidate-space mining, SaHa shifts the mechanism to the query-space by mapping retrieved candidates back to their…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsDomain Adaptation and Few-Shot Learning · Topic Modeling · Multimodal Machine Learning Applications