TL;DR
SLQ is a parameter-efficient framework that adapts multimodal large language models for retrieval tasks by using shared latent queries, preserving the pre-trained model's knowledge and outperforming invasive fine-tuning methods.
Contribution
Introduces SLQ, a novel non-invasive tuning method with shared latent queries for multimodal retrieval, and constructs KARR-Bench for knowledge-aware reasoning evaluation.
Findings
SLQ outperforms full fine-tuning and LoRA on COCO and Flickr30K datasets.
SLQ achieves competitive performance on MMEB.
SLQ yields substantial gains on the KARR-Bench benchmark.
Abstract
Multimodal Large Language Models (MLLMs) possess intrinsic reasoning and world-knowledge capabilities, yet adapting them for dense retrieval remains challenging. Existing approaches rely on invasive parameter updates, such as full fine-tuning and LoRA, which may disrupt the pre-trained semantic space and impair the structured knowledge essential for reasoning. To address this, we propose SLQ, a parameter-efficient tuning framework that adapts MLLMs for retrieval while keeping the backbone entirely frozen. SLQ introduces a small set of Shared Latent Queries that are appended to both text and image tokens, leveraging the model's native causal attention to aggregate multimodal context into a unified embedding space. Furthermore, to better evaluate retrieval beyond superficial pattern matching, we construct KARR-Bench, a benchmark designed for knowledge-aware reasoning retrieval. Extensive…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
