Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto,, Marcella Cornia, Lorenzo Baraldi, Rita Cucchiara

TL;DR
Wiki-LLaVA enhances multimodal large language models by integrating a hierarchical retrieval system to access external multimodal knowledge, significantly improving their ability to answer knowledge-dependent visual questions.
Contribution
The paper introduces Wiki-LLaVA, a novel hierarchical retrieval-augmented generation framework for multimodal LLMs, enabling effective external knowledge integration for visual question answering.
Findings
Improved accuracy in knowledge-dependent visual question answering
Effective retrieval of relevant multimodal passages from external sources
Enhanced dialogue precision with external knowledge augmentation
Abstract
Multimodal LLMs are the natural evolution of LLMs, and enlarge their capabilities so as to work beyond the pure textual modality. As research is being carried out to design novel architectures and vision-and-language adapters, in this paper we concentrate on endowing such models with the capability of answering questions that require external knowledge. Our approach, termed Wiki-LLaVA, aims at integrating an external knowledge source of multimodal documents, which is accessed through a hierarchical retrieval pipeline. Relevant passages, using this approach, are retrieved from the external knowledge source and employed as additional context for the LLM, augmenting the effectiveness and precision of generated dialogues. We conduct extensive experiments on datasets tailored for visual question answering with external data and demonstrate the appropriateness of our approach.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Semantic Web and Ontologies · Topic Modeling
