Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations

Yibo Yan; Mingdong Ou; Yi Cao; Xin Zou; Shuliang Liu; Jiahao Huo; Yu Huang; James Kwok; Xuming Hu

arXiv:2603.01666·cs.CL·March 3, 2026

Beyond the Grid: Layout-Informed Multi-Vector Retrieval with Parsed Visual Document Representations

Yibo Yan, Mingdong Ou, Yi Cao, Xin Zou, Shuliang Liu, Jiahao Huo, Yu Huang, James Kwok, Xuming Hu

PDF

Open Access

TL;DR

This paper introduces ColParse, a layout-aware multi-vector retrieval method that significantly reduces storage needs and improves performance in visual document retrieval by leveraging parsed visual document representations.

Contribution

ColParse is a novel paradigm that uses document parsing to generate compact, layout-informed sub-image embeddings fused with global vectors for efficient retrieval.

Findings

01

Reduces storage by over 95%

02

Achieves significant performance improvements

03

Bridges gap between accuracy and scalability

Abstract

Harnessing the full potential of visually-rich documents requires retrieval systems that understand not just text, but intricate layouts, a core challenge in Visual Document Retrieval (VDR). The prevailing multi-vector architectures, while powerful, face a crucial storage bottleneck that current optimization strategies, such as embedding merging, pruning, or using abstract tokens, fail to resolve without compromising performance or ignoring vital layout cues. To address this, we introduce ColParse, a novel paradigm that leverages a document parsing model to generate a small set of layout-informed sub-image embeddings, which are then fused with a global page-level vector to create a compact and structurally-aware multi-vector representation. Extensive experiments demonstrate that our method reduces storage requirements by over 95% while simultaneously yielding significant performance…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Image and Video Retrieval Techniques · Multimodal Machine Learning Applications · Handwritten Text Recognition Techniques