Generative Cross-Modal Retrieval: Memorizing Images in Multimodal   Language Models for Retrieval and Beyond

Yongqi Li; Wenjie Wang; Leigang Qu; Liqiang Nie; Wenjie Li; Tat-Seng; Chua

arXiv:2402.10805·cs.MM·February 19, 2024·1 cites

Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond

Yongqi Li, Wenjie Wang, Leigang Qu, Liqiang Nie, Wenjie Li, Tat-Seng, Chua

PDF

Open Access 1 Video

TL;DR

This paper introduces a novel generative cross-modal retrieval framework that enables multimodal large language models to memorize and recall images through a two-step training process, enhancing retrieval capabilities beyond traditional methods.

Contribution

The paper presents a new generative paradigm for cross-modal retrieval that allows MLLMs to memorize images and retrieve them via textual queries, differing from previous discriminative approaches.

Findings

01

Effective image memorization and retrieval demonstrated

02

Performs well with large-scale image candidate sets

03

Offers an efficient alternative to discriminative methods

Abstract

The recent advancements in generative language models have demonstrated their ability to memorize knowledge from documents and recall knowledge to respond to user queries effectively. Building upon this capability, we propose to enable multimodal large language models (MLLMs) to memorize and recall images within their parameters. Given a user query for visual content, the MLLM is anticipated to "recall" the relevant image from its parameters as the response. Achieving this target presents notable challenges, including inbuilt visual memory and visual recall schemes within MLLMs. To address these challenges, we introduce a generative cross-modal retrieval framework, which assigns unique identifier strings to represent images and involves two training steps: learning to memorize and learning to retrieve. The first step focuses on training the MLLM to memorize the association between…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

Generative Cross-Modal Retrieval: Memorizing Images in Multimodal Language Models for Retrieval and Beyond· underline

Taxonomy

TopicsMultimodal Machine Learning Applications