MINER: Mining Multimodal Internal Representation for Efficient Retrieval
Weien Li, Rui Song, Zeyu Li, Haochen Liu, Gonghao Zhang, Difan Jiao, Zhenwei Tang, Bowei He, Haolun Wu, Xue Liu, Ye Yuan

TL;DR
MINER introduces a novel method to enhance dense single-vector document retrieval by leveraging internal transformer layer signals, significantly improving retrieval quality without increasing storage or latency.
Contribution
The paper proposes MINER, a lightweight plug-in that probes and fuses internal transformer representations to boost retrieval performance while maintaining efficiency.
Findings
MINER outperforms existing dense single-vector retrievers on multiple benchmarks.
MINER narrows the gap between dense and late-interaction retrieval methods.
Up to 4.5% nDCG@5 improvement over backbone models.
Abstract
Visual document retrieval has become essential for accessing information in visually rich documents. Existing approaches fall into two camps. Late-interaction retrievers achieve strong quality through fine-grained token-level matching but store hundreds of vectors per page, incurring large index footprints and high serving costs. By contrast, dense single-vector retrievers retain storage and latency advantages but consistently lag in quality because they compress all information into a single final-layer embedding. In this work, we first conduct a layerwise diagnostic on single-vector retrievers, revealing that retrieval-relevant signal resides in internal representations. Motivated by these findings, we propose MINER (Mining Multimodal Internal RepreseNtation for Efficient Retrieval), a lightweight plug-in module that probes and fuses internal signals across transformer layers into a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
