Multimodal RAG Enhanced Visual Description

Amit Kumar Jaiswal; Haiming Liu; Ingo Frommholz

arXiv:2508.09170·cs.LG·August 14, 2025

Multimodal RAG Enhanced Visual Description

Amit Kumar Jaiswal, Haiming Liu, Ingo Frommholz

PDF

TL;DR

This paper introduces a lightweight, training-free retrieval-augmented generation method that aligns visual and textual modalities in large multimodal models, improving image description quality without extensive fine-tuning.

Contribution

It proposes a novel, efficient approach using a linear mapping for modality alignment in large multimodal models, avoiding costly training and fine-tuning.

Findings

01

Significant improvements on benchmark datasets

02

Effective modality gap bridging without training

03

Enhanced image description generation quality

Abstract

Textual descriptions for multimodal inputs entail recurrent refinement of queries to produce relevant output images. Despite efforts to address challenges such as scaling model size and data volume, the cost associated with pre-training and fine-tuning remains substantial. However, pre-trained large multimodal models (LMMs) encounter a modality gap, characterised by a misalignment between textual and visual representations within a common embedding space. Although fine-tuning can potentially mitigate this gap, it is typically expensive and impractical due to the requirement for extensive domain-driven data. To overcome this challenge, we propose a lightweight training-free approach utilising Retrieval-Augmented Generation (RAG) to extend across the modality using a linear mapping, which can be computed efficiently. During inference, this mapping is applied to images embedded by an LMM…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.