DRUM: Learning Demonstration Retriever for Large MUlti-modal Models

Ellen Yi-Ge; Jiechao Gao; Wei Han; Wei Zhu

arXiv:2412.07619·cs.CL·December 11, 2024

DRUM: Learning Demonstration Retriever for Large MUlti-modal Models

Ellen Yi-Ge, Jiechao Gao, Wei Han, Wei Zhu

PDF

Open Access

TL;DR

DRUM is a novel framework that fine-tunes a visual-language embedding model to improve demonstration retrieval, significantly enhancing large vision-language models' in-context learning across multiple tasks and datasets.

Contribution

It introduces a demonstration retriever that is fine-tuned with feedback and re-ranking strategies, optimizing demonstration selection for LVLMs.

Findings

01

Improves in-context learning performance on visual-language tasks

02

Effective across multiple datasets and task types

03

Enhances demonstration relevance through iterative mining

Abstract

Recently, large language models (LLMs) have demonstrated impressive capabilities in dealing with new tasks with the help of in-context learning (ICL). In the study of Large Vision-Language Models (LVLMs), when implementing ICL, researchers usually adopts the naive strategies like fixed demonstrations across different samples, or selecting demonstrations directly via a visual-language embedding model. These methods does not guarantee the configured demonstrations fit the need of the LVLMs. To address this issue, we now propose a novel framework, \underline{d}emonstration \underline{r}etriever for large m\underline{u}lti-modal \underline{m}odel (DRUM), which fine-tunes the visual-language embedding model to better meet the LVLM's needs. First, we discuss the retrieval strategies for a visual-language task, assuming an embedding model is given. And we propose to concate the image and text…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications