Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

Longxiang Zhang; Weilong Dai; Guanghao Zhang; Hao Jiang; Pipei Huang

arXiv:2605.14448·cs.CV·May 15, 2026

Think When Needed: Adaptive Reasoning-Driven Multimodal Embeddings with a Dual-LoRA Architecture

Longxiang Zhang, Weilong Dai, Guanghao Zhang, Hao Jiang, Pipei Huang

PDF

2 Models 1 Datasets

TL;DR

TWN is an adaptive multimodal embedding framework that selectively employs reasoning based on input complexity, improving retrieval quality and efficiency with minimal additional parameters.

Contribution

It introduces a dual-LoRA architecture with an adaptive routing mechanism to generate reasoning only when necessary, reducing inference costs and enhancing performance.

Findings

01

Achieves state-of-the-art embedding quality on MMEB-V2 tasks.

02

Requires only 3-5% additional parameters compared to the backbone.

03

Reduces reasoning tokens by up to 50% compared to full generative methods.

Abstract

Multimodal large language models (MLLMs) have emerged as a powerful backbone for multimodal embeddings. Recent methods introduce chain-of-thought (CoT) reasoning into the embedding pipeline to improve retrieval quality, but remain costly in both model size and inference cost. They typically employ separate reasoner and embedder with substantial parameter overhead, and generate CoT indiscriminately for every input. However, we observe that for simple inputs, discriminative embeddings already perform well, and redundant reasoning can even mislead the model, degrading performance. To address these limitations, we propose Think When Needed (TWN), a unified multimodal embedding framework with adaptive reasoning. TWN introduces a dual-LoRA architecture that attaches reasoning and embedding adapters to a shared frozen backbone, detaching gradients at their interface to mitigate gradient…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Models

Datasets

zhanglx/TWN-training-data
dataset

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.