A Picture is Worth a Thousand Words? An Empirical Study of Aggregation Strategies for Visual Financial Document Retrieval
Ho Hung Lim, Yi Yang

TL;DR
This study evaluates the effectiveness of single-vector aggregation in visual financial document retrieval, revealing significant semantic information loss and risks in practical applications.
Contribution
It introduces a diagnostic benchmark and demonstrates that single-vector aggregation often collapses distinct documents, risking retrieval accuracy in financial contexts.
Findings
Single-vector aggregation obscures semantic differences between documents.
Patch-level analysis detects semantic changes that aggregation misses.
Global texture dominance causes information loss in aggregation.
Abstract
Visual RAG has offered an alternative to traditional RAG. It treats documents as images and uses vision encoders to obtain vision patch tokens. However, hundreds of patch tokens per document create retrieval and storage challenges in a vector database. Practical deployment requires aggregating them into a single vector. This raises a critical question: does single-vector aggregation lose key information in financial documents? We develop a diagnostic benchmark using financial documents where changes in single digits can lead to significant semantic shifts. Our experiments show that single-vector aggregation collapses different documents with almost identical vectors. Metrics show that the patch level detects semantic changes, and confirm that aggregation obscures these details. We identify global texture dominance as the root cause. Our findings are consistent across model scales,…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
