Multimodal RAG Enhanced Visual Description
Amit Kumar Jaiswal, Haiming Liu, Ingo Frommholz

TL;DR
This paper introduces a lightweight, training-free retrieval-augmented generation method that aligns visual and textual modalities in large multimodal models, improving image description quality without extensive fine-tuning.
Contribution
It proposes a novel, efficient approach using a linear mapping for modality alignment in large multimodal models, avoiding costly training and fine-tuning.
Findings
Significant improvements on benchmark datasets
Effective modality gap bridging without training
Enhanced image description generation quality
Abstract
Textual descriptions for multimodal inputs entail recurrent refinement of queries to produce relevant output images. Despite efforts to address challenges such as scaling model size and data volume, the cost associated with pre-training and fine-tuning remains substantial. However, pre-trained large multimodal models (LMMs) encounter a modality gap, characterised by a misalignment between textual and visual representations within a common embedding space. Although fine-tuning can potentially mitigate this gap, it is typically expensive and impractical due to the requirement for extensive domain-driven data. To overcome this challenge, we propose a lightweight training-free approach utilising Retrieval-Augmented Generation (RAG) to extend across the modality using a linear mapping, which can be computed efficiently. During inference, this mapping is applied to images embedded by an LMM…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
