Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond
Yongqi Li, Wenjie Wang, Leigang Qu, Liqiang Nie, Wenjie Li, Tat-Seng, Chua

TL;DR
This paper introduces a novel generative cross-modal retrieval framework that enables multimodal large language models to memorize and recall images through a two-step training process, enhancing retrieval capabilities beyond traditional methods.
Contribution
The paper presents a new generative paradigm for cross-modal retrieval that allows MLLMs to memorize images and retrieve them via textual queries, differing from previous discriminative approaches.
Findings
Effective image memorization and retrieval demonstrated
Performs well with large-scale image candidate sets
Offers an efficient alternative to discriminative methods
Abstract
The recent advancements in generative language models have demonstrated their ability to memorize knowledge from documents and recall knowledge to respond to user queries effectively. Building upon this capability, we propose to enable multimodal large language models (MLLMs) to memorize and recall images within their parameters. Given a user query for visual content, the MLLM is anticipated to "recall" the relevant image from its parameters as the response. Achieving this target presents notable challenges, including inbuilt visual memory and visual recall schemes within MLLMs. To address these challenges, we introduce a generative cross-modal retrieval framework, which assigns unique identifier strings to represent images and involves two training steps: learning to memorize and learning to retrieve. The first step focuses on training the MLLM to memorize the association between…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsMultimodal Machine Learning Applications
