DRUM: Learning Demonstration Retriever for Large MUlti-modal Models
Ellen Yi-Ge, Jiechao Gao, Wei Han, Wei Zhu

TL;DR
DRUM is a novel framework that fine-tunes a visual-language embedding model to improve demonstration retrieval, significantly enhancing large vision-language models' in-context learning across multiple tasks and datasets.
Contribution
It introduces a demonstration retriever that is fine-tuned with feedback and re-ranking strategies, optimizing demonstration selection for LVLMs.
Findings
Improves in-context learning performance on visual-language tasks
Effective across multiple datasets and task types
Enhances demonstration relevance through iterative mining
Abstract
Recently, large language models (LLMs) have demonstrated impressive capabilities in dealing with new tasks with the help of in-context learning (ICL). In the study of Large Vision-Language Models (LVLMs), when implementing ICL, researchers usually adopts the naive strategies like fixed demonstrations across different samples, or selecting demonstrations directly via a visual-language embedding model. These methods does not guarantee the configured demonstrations fit the need of the LVLMs. To address this issue, we now propose a novel framework, \underline{d}emonstration \underline{r}etriever for large m\underline{u}lti-modal \underline{m}odel (DRUM), which fine-tunes the visual-language embedding model to better meet the LVLM's needs. First, we discuss the retrieval strategies for a visual-language task, assuming an embedding model is given. And we propose to concate the image and text…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Multimodal Machine Learning Applications
