Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

Yibo Yan; Mingdong Ou; Yi Cao; Jiahao Huo; Xin Zou; Shuliang Liu; James Kwok; Xuming Hu

arXiv:2604.10167·cs.CV·April 14, 2026

Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

Yibo Yan, Mingdong Ou, Yi Cao, Jiahao Huo, Xin Zou, Shuliang Liu, James Kwok, Xuming Hu

PDF

TL;DR

This paper introduces ColChunk, a framework for efficient visual document retrieval that significantly reduces storage needs while improving accuracy through hierarchical clustering of image patches.

Contribution

ColChunk is a novel, adaptable late chunking method that enhances multi-vector models with spatial-semantic coherence, improving efficiency and accuracy.

Findings

01

Achieves over 90% reduction in storage requirements.

02

Delivers a 9-point average improvement in nDCG@5.

03

Effective across 24 diverse VDR datasets.

Abstract

Multi-vector models dominate Visual Document Retrieval (VDR) due to their fine-grained matching capabilities, but their high storage and computational costs present a major barrier to practical deployment. In this paper, we propose ColChunk, a plug-and-play framework that introduces multimodal late chunking to construct efficient, contextualized multi-vectors. Unlike existing pruning or fixed-token approaches, ColChunk employs hierarchical clustering on patch-level embeddings, fused with a 2D position prior to ensure spatial-semantic coherence. This adaptive grouping allows for a content-aware representation that preserves global context while drastically reducing the vector count. Evaluations across 24 VDR datasets demonstrate ColChunk achieves over a 90% reduction in storage requirements while simultaneously delivering a 9-point average improvement in nDCG@5 across representative…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.