MINER: Mining Multimodal Internal Representation for Efficient Retrieval

Weien Li; Rui Song; Zeyu Li; Haochen Liu; Gonghao Zhang; Difan Jiao; Zhenwei Tang; Bowei He; Haolun Wu; Xue Liu; Ye Yuan

arXiv:2605.06460·cs.LG·May 8, 2026

MINER: Mining Multimodal Internal Representation for Efficient Retrieval

Weien Li, Rui Song, Zeyu Li, Haochen Liu, Gonghao Zhang, Difan Jiao, Zhenwei Tang, Bowei He, Haolun Wu, Xue Liu, Ye Yuan

PDF

TL;DR

MINER introduces a novel method to enhance dense single-vector document retrieval by leveraging internal transformer layer signals, significantly improving retrieval quality without increasing storage or latency.

Contribution

The paper proposes MINER, a lightweight plug-in that probes and fuses internal transformer representations to boost retrieval performance while maintaining efficiency.

Findings

01

MINER outperforms existing dense single-vector retrievers on multiple benchmarks.

02

MINER narrows the gap between dense and late-interaction retrieval methods.

03

Up to 4.5% nDCG@5 improvement over backbone models.

Abstract

Visual document retrieval has become essential for accessing information in visually rich documents. Existing approaches fall into two camps. Late-interaction retrievers achieve strong quality through fine-grained token-level matching but store hundreds of vectors per page, incurring large index footprints and high serving costs. By contrast, dense single-vector retrievers retain storage and latency advantages but consistently lag in quality because they compress all information into a single final-layer embedding. In this work, we first conduct a layerwise diagnostic on single-vector retrievers, revealing that retrieval-relevant signal resides in internal representations. Motivated by these findings, we propose MINER (Mining Multimodal Internal RepreseNtation for Efficient Retrieval), a lightweight plug-in module that probes and fuses internal signals across transformer layers into a…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.