Explore More, Learn Better: Parallel MLLM Embeddings under Mutual Information Minimization
Zhicheng Wang, Chen Ju, Xu Chen, Shuai Xiao, Jinsong Lan, Xiaoyong Zhu, Ying Chen, Zhiguo Cao

TL;DR
This paper introduces a parallel decoupling framework for multimodal embedding learning in MLLMs, employing mutual information minimization and contrastive supervision to generate diverse, robust, and semantically aligned embeddings efficiently.
Contribution
It proposes a novel Parallel Decoupling Framework (PDF) that leverages MLLMs' steerability to produce diverse embeddings through parallel paths with minimal computational overhead.
Findings
Achieved up to +8.9% improvement on MMEB benchmark.
Significant gains across various model sizes and resolutions.
Enhanced embedding diversity and semantic coverage.
Abstract
Embedding models are a cornerstone of modern AI. Driven by Multimodal Large Language Models (MLLMs), they have made great progress in architecture and data curation, while the holistic paradigm is still limited to SSC, i.e., single input, singular embedding, contrastive supervision, which collapses rich, multifaceted inputs into monolithic embeddings and fails to fully exploit MLLM capabilities. In this paper, we tailor one Parallel Decoupling Framework (PDF) for multimodal embedding learning, by utilizing the proprietary steerability of MLLMs, i.e., their ability to flexibly generate quite differentiated response under explicit instructions. Concretely, PDF conditions a shared MLLM backbone on distinct, learnable prefixes to roll out multiple parallel paths for one input, then relies on these paths to obtain parallel embeddings. To promote full parallel diversity, we employ Mutual…
Peer Reviews
Decision·Submitted to ICLR 2026
- Parallel prefix-conditioned paths + explicit MI minimization within MLLMs. - Gains across tasks/scales, including compute-reduced settings. - Clear framing (SSC→SPP) and conceptual diagrams. - Practical recipe for better MLLM embeddings with negligible inference overhead.
- Possible instability/bias of vCLUB MI estimation; lack of analysis on estimator variance and its effect on training. - Limited ablation on #paths, prefix depth/placement, and aggregation design. - Generalization beyond MMEB (e.g., long-form retrieval/RAG latency-quality tradeoffs) not explored.
1. The deep prefix injection design is a clever extension of prefix-tuning that operates at every layer, rather than only at the input, offering a plausible mechanism to steer MLLMs toward richer, more diverse embedding spaces. 2. The authors present extensive benchmarks on multiple model scales and backbones
1. Lack of evidence that deep prefix injection truly induces semantic diversity. There is no analysis of attention distributions, token contributions, or feature subspace diversity. 2. Table 3 shows nearly identical performance between Single Prefix and Aggregate inference strategies. If the learned embeddings are genuinely diverse, aggregation should bring at least a small improvement. This raises reasonable doubt that the parallel paths have collapsed or that diversity is not reflected in perf
The paper’s strengths span multiple dimensions. In terms of originality, it presents a fresh reformulation of multimodal embedding learning by replacing the conventional Single input–Singular embedding–Contrastive supervision (SSC) paradigm with the proposed Single input–Parallel paths–Parallel outputs (SPP) paradigm. This idea, realizing multiple, decorrelated embeddings via deep prefix injection and mutual information minimization, creatively leverages the inherent steerability of MLLMs, repre
1. Most experiments are conducted on the MMEB benchmark, with limited evaluation across other multimodal embedding datasets (e.g., MSCOCO, LAION-Aesthetics, or ImageNet-Text retrieval). This restricts the evidence for generalization and may introduce benchmark-specific bias. Adding results on diverse domains would better demonstrate robustness and transferability. 2. The paper proposes learnable prefixes for parallel embedding generation but provides little insight into how prefix dimensionalit
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Graph Neural Networks · Topic Modeling · Multimodal Machine Learning Applications
