Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework

Yibo Yan; Mingdong Ou; Yi Cao; Xin Zou; Jiahao Huo; Shuliang Liu; James Kwok; Xuming Hu

arXiv:2602.19549·cs.CL·April 21, 2026

Sculpting the Vector Space: Towards Efficient Multi-Vector Visual Document Retrieval via Prune-then-Merge Framework

Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Jiahao Huo, Shuliang Liu, James Kwok, Xuming Hu

PDF

TL;DR

This paper introduces a Prune-then-Merge framework for visual document retrieval that improves efficiency and compression without sacrificing semantic fidelity, outperforming existing methods across multiple datasets.

Contribution

The paper proposes a novel two-stage Prune-then-Merge framework that enhances multi-vector visual document retrieval efficiency and compression while maintaining high semantic accuracy.

Findings

01

Outperforms existing methods on 29 datasets

02

Extends near-lossless compression range

03

Maintains robust performance at high compression ratios

Abstract

Visual Document Retrieval (VDR), which aims to retrieve relevant pages within vast corpora of visually-rich documents, is of significance in current multimodal retrieval applications. The state-of-the-art multi-vector paradigm excels in performance but suffers from prohibitive overhead, a problem that current efficiency methods like pruning and merging address imperfectly, creating a difficult trade-off between compression rate and feature fidelity. To overcome this dilemma, we introduce Prune-then-Merge, a novel two-stage framework that synergizes these complementary approaches. Our method first employs an adaptive pruning stage to filter out low-information patches, creating a refined, high-signal set of embeddings. Subsequently, a hierarchical merging stage compresses this pre-filtered set, effectively summarizing semantic content without the noise-induced feature dilution seen in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.