TL;DR
PLUME introduces a latent reasoning framework for universal multimodal embedding that replaces explicit chain-of-thought with continuous latent states, achieving faster inference and improved performance on complex multimodal tasks.
Contribution
It proposes a novel latent reasoning approach with a semantic-anchor-guided transition adapter and a progressive training curriculum, outperforming explicit-CoT methods in speed and accuracy.
Findings
Outperforms explicit-CoT UME baselines on MMEB-v2 benchmark.
Reduces reasoning steps from hundreds to fewer than 10 latent steps.
Achieves over 30x faster inference in retrieval tasks.
Abstract
Universal multimodal embedding (UME) maps heterogeneous inputs into a shared retrieval space with a single model. Recent approaches improve UME by generating explicit chain-of-thought (CoT) rationales before extracting embeddings, enabling multimodal large language models to better infer complex query intent. However, explicit CoT incurs substantial inference overhead and can compress rich multimodal evidence into a narrow textual bottleneck. We propose PLUME, a latent reasoning framework that advances UME by replacing verbalized CoT with a short autoregressive rollout of continuous latent states. To support diverse multimodal queries, PLUME further introduces a semantic-anchor-guided transition adapter that steers latent rollout along different reasoning trajectories under the same fixed computation budget. To stabilize training, PLUME adopts a progressive explicit-to-latent curriculum…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
