Generative Multi-Modal Knowledge Retrieval with Large Language Models
Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, Kaiyan Zhang, Bowen, Zhou, Jie Zhou

TL;DR
This paper introduces an end-to-end generative framework leveraging large language models for multi-modal knowledge retrieval, improving effectiveness and training efficiency in handling multi-modal queries.
Contribution
It proposes a novel approach combining object-aware prefix-tuning and knowledge-guided generation to enhance multi-modal knowledge retrieval with LLMs.
Findings
Achieved 3.0% to 14.6% improvements on three benchmarks.
Effectively aligns multi-grained visual features into textual space.
Demonstrates the effectiveness of the proposed framework over strong baselines.
Abstract
Knowledge retrieval with multi-modal queries plays a crucial role in supporting knowledge-intensive multi-modal applications. However, existing methods face challenges in terms of their effectiveness and training efficiency, especially when it comes to training and integrating multiple retrievers to handle multi-modal queries. In this paper, we propose an innovative end-to-end generative framework for multi-modal knowledge retrieval. Our framework takes advantage of the fact that large language models (LLMs) can effectively serve as virtual knowledge bases, even when trained with limited data. We retrieve knowledge via a two-step process: 1) generating knowledge clues related to the queries, and 2) obtaining the relevant document by searching databases using the knowledge clue. In particular, we first introduce an object-aware prefix-tuning technique to guide multi-grained visual…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Natural Language Processing Techniques
MethodsALIGN
