Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems

Elias Lumer; Alex Cardenas; Matt Melich; Myles Mason; Sara Dieter; Vamse Kumar Subbiah; Pradeep Honaganahalli Basavaraju; Roberto Hernandez

arXiv:2511.16654·cs.CL·November 25, 2025

Comparison of Text-Based and Image-Based Retrieval in Multimodal Retrieval Augmented Generation Large Language Model Systems

Elias Lumer, Alex Cardenas, Matt Melich, Myles Mason, Sara Dieter, Vamse Kumar Subbiah, Pradeep Honaganahalli Basavaraju, Roberto Hernandez

PDF

Open Access

TL;DR

This paper compares text-based and image-based retrieval methods in multimodal Retrieval-Augmented Generation systems, showing that direct multimodal embedding retrieval significantly improves accuracy and preserves visual context over text summarization approaches.

Contribution

It provides a comprehensive analysis demonstrating the superiority of direct multimodal embedding retrieval over text-based summarization in multimodal RAG systems.

Findings

01

Direct multimodal retrieval outperforms text-based approaches by 13-32% in key metrics.

02

Direct retrieval yields more accurate and factually consistent answers.

03

Text summarization causes significant information loss during preprocessing.

Abstract

Recent advancements in Retrieval-Augmented Generation (RAG) have enabled Large Language Models (LLMs) to access multimodal knowledge bases containing both text and visual information such as charts, diagrams, and tables in financial documents. However, existing multimodal RAG systems rely on LLM-based summarization to convert images into text during preprocessing, storing only text representations in vector databases, which causes loss of contextual information and visual details critical for downstream retrieval and question answering. To address this limitation, we present a comprehensive comparative analysis of two retrieval approaches for multimodal RAG systems, including text-based chunk retrieval (where images are summarized into text before embedding) and direct multimodal embedding retrieval (where images are stored natively in the vector space). We evaluate all three approaches…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Topic Modeling · Advanced Graph Neural Networks