Explore More, Learn Better: Parallel MLLM Embeddings under Mutual Information Minimization

Zhicheng Wang; Chen Ju; Xu Chen; Shuai Xiao; Jinsong Lan; Xiaoyong Zhu; Ying Chen; Zhiguo Cao

arXiv:2511.01588·cs.LG·November 24, 2025

Explore More, Learn Better: Parallel MLLM Embeddings under Mutual Information Minimization

Zhicheng Wang, Chen Ju, Xu Chen, Shuai Xiao, Jinsong Lan, Xiaoyong Zhu, Ying Chen, Zhiguo Cao

PDF

Open Access 2 Models 3 Reviews

TL;DR

This paper introduces a parallel decoupling framework for multimodal embedding learning in MLLMs, employing mutual information minimization and contrastive supervision to generate diverse, robust, and semantically aligned embeddings efficiently.

Contribution

It proposes a novel Parallel Decoupling Framework (PDF) that leverages MLLMs' steerability to produce diverse embeddings through parallel paths with minimal computational overhead.

Findings

01

Achieved up to +8.9% improvement on MMEB benchmark.

02

Significant gains across various model sizes and resolutions.

03

Enhanced embedding diversity and semantic coverage.

Abstract

Embedding models are a cornerstone of modern AI. Driven by Multimodal Large Language Models (MLLMs), they have made great progress in architecture and data curation, while the holistic paradigm is still limited to SSC, i.e., single input, singular embedding, contrastive supervision, which collapses rich, multifaceted inputs into monolithic embeddings and fails to fully exploit MLLM capabilities. In this paper, we tailor one Parallel Decoupling Framework (PDF) for multimodal embedding learning, by utilizing the proprietary steerability of MLLMs, i.e., their ability to flexibly generate quite differentiated response under explicit instructions. Concretely, PDF conditions a shared MLLM backbone on distinct, learnable prefixes to roll out multiple parallel paths for one input, then relies on these paths to obtain parallel embeddings. To promote full parallel diversity, we employ Mutual…

Peer Reviews

Decision·Submitted to ICLR 2026

Reviewer 01Rating 6Confidence 3

Strengths

- Parallel prefix-conditioned paths + explicit MI minimization within MLLMs. - Gains across tasks/scales, including compute-reduced settings. - Clear framing (SSC→SPP) and conceptual diagrams. - Practical recipe for better MLLM embeddings with negligible inference overhead.

Weaknesses

- Possible instability/bias of vCLUB MI estimation; lack of analysis on estimator variance and its effect on training. - Limited ablation on #paths, prefix depth/placement, and aggregation design. - Generalization beyond MMEB (e.g., long-form retrieval/RAG latency-quality tradeoffs) not explored.

Reviewer 02Rating 4Confidence 4

Strengths

1. The deep prefix injection design is a clever extension of prefix-tuning that operates at every layer, rather than only at the input, offering a plausible mechanism to steer MLLMs toward richer, more diverse embedding spaces. 2. The authors present extensive benchmarks on multiple model scales and backbones

Weaknesses

1. Lack of evidence that deep prefix injection truly induces semantic diversity. There is no analysis of attention distributions, token contributions, or feature subspace diversity. 2. Table 3 shows nearly identical performance between Single Prefix and Aggregate inference strategies. If the learned embeddings are genuinely diverse, aggregation should bring at least a small improvement. This raises reasonable doubt that the parallel paths have collapsed or that diversity is not reflected in perf

Reviewer 03Rating 6Confidence 3

Strengths

The paper’s strengths span multiple dimensions. In terms of originality, it presents a fresh reformulation of multimodal embedding learning by replacing the conventional Single input–Singular embedding–Contrastive supervision (SSC) paradigm with the proposed Single input–Parallel paths–Parallel outputs (SPP) paradigm. This idea, realizing multiple, decorrelated embeddings via deep prefix injection and mutual information minimization, creatively leverages the inherent steerability of MLLMs, repre

Weaknesses

1. Most experiments are conducted on the MMEB benchmark, with limited evaluation across other multimodal embedding datasets (e.g., MSCOCO, LAION-Aesthetics, or ImageNet-Text retrieval). This restricts the evidence for generalization and may introduce benchmark-specific bias. Adding results on diverse domains would better demonstrate robustness and transferability. 2. The paper proposes learnable prefixes for parallel embedding generation but provides little insight into how prefix dimensionalit

Code & Models

Models

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Graph Neural Networks · Topic Modeling · Multimodal Machine Learning Applications