Evaluating VisualRAG: Quantifying Cross-Modal Performance in Enterprise Document Understanding

Varun Mannam; Fang Wang; and Xin Chen

arXiv:2506.21604·cs.IR·June 30, 2025

Evaluating VisualRAG: Quantifying Cross-Modal Performance in Enterprise Document Understanding

Varun Mannam, Fang Wang, and Xin Chen

PDF

Open Access

TL;DR

This paper introduces a quantitative benchmarking framework for assessing trustworthiness in multimodal VisualRAG systems, demonstrating how optimal modality weighting improves performance and reliability in enterprise document understanding.

Contribution

The paper presents a systematic framework for measuring trustworthiness in multimodal VisualRAG systems, linking technical metrics with user trust, and evaluates foundation models' impact on enterprise AI reliability.

Findings

01

Optimal modality weights (30% text, 15% image, 25% caption, 30% OCR) improve performance by 57.3%.

02

Framework establishes quantitative relationships between technical metrics and user trust.

03

Foundation models differ significantly in trustworthiness for captioning and OCR tasks.

Abstract

Current evaluation frameworks for multimodal generative AI struggle to establish trustworthiness, hindering enterprise adoption where reliability is paramount. We introduce a systematic, quantitative benchmarking framework to measure the trustworthiness of progressively integrating cross-modal inputs such as text, images, captions, and OCR within VisualRAG systems for enterprise document intelligence. Our approach establishes quantitative relationships between technical metrics and user-centric trust measures. Evaluation reveals that optimal modality weighting with weights of 30% text, 15% image, 25% caption, and 30% OCR improves performance by 57.3% over text-only baselines while maintaining computational efficiency. We provide comparative assessments of foundation models, demonstrating their differential impact on trustworthiness in caption generation and OCR extraction-a vital…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Handwritten Text Recognition Techniques