MURE: Hierarchical Multi-Resolution Encoding via Vision-Language Models for Visual Document Retrieval
Fengbin Zhu, Zijing Cai, Yuzhe Wang, Pengyang Shao, Wenjie Wang, Fuli Feng, Richang Hong, and Tat-Seng Chua

TL;DR
This paper introduces MURE, a hierarchical multi-resolution encoding framework using vision-language models for visual document retrieval, balancing detailed visual features and global structure efficiently.
Contribution
It proposes a novel hierarchical multi-resolution encoding framework with feature fusion and token compression, advancing visual document retrieval effectiveness and efficiency.
Findings
MURE outperforms strong baselines on VDR benchmarks.
It reduces visual token budget by 50% while maintaining performance.
The framework effectively captures multi-scale visual cues.
Abstract
Visual Document Retrieval (VDR) requires representations that capture both fine-grained visual details and global document structure to ensure retrieval efficacy while maintaining computational efficiency. Existing VDR models struggle to balance effectiveness and efficiency when processing high-resolution documents: they often either lose fine-grained information or generate an excessive number of visual tokens, resulting in significant indexing overhead and high retrieval latency. In this work, we rethink the visual encoding mechanism and propose a new X-VisEmb paradigm that progresses from multi-resolution sampling and encoding, through cross-granularity feature fusion, to adaptive representation distillation. A preliminary study validates its feasibility and effectiveness in capturing complementary visual cues at varying scales. Building on the insights, we develop MURE, a novel…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Handwritten Text Recognition Techniques
