Visual In-Context Learning for Large Vision-Language Models
Yucheng Zhou, Xiang Li, Qianning Wang, Jianbing Shen

TL;DR
This paper introduces a novel Visual In-Context Learning (VICL) approach for large vision-language models, improving cross-modal interactions and reasoning through retrieval, summarization, and demonstration composition, validated on multiple datasets.
Contribution
The paper proposes VICL, a comprehensive method combining retrieval, summarization, and demonstration composition to enhance in-context learning in large vision-language models.
Findings
VICL improves visual reasoning performance across five datasets.
The method reduces token count and alleviates cross-modal interaction issues.
In-context unlearning effectively resets model knowledge without retraining.
Abstract
In Large Visual Language Models (LVLMs), the efficacy of In-Context Learning (ICL) remains limited by challenges in cross-modal interactions and representation disparities. To overcome these challenges, we introduce a novel Visual In-Context Learning (VICL) method comprising Visual Demonstration Retrieval, Intent-Oriented Image Summarization, and Intent-Oriented Demonstration Composition. Our approach retrieves images via ''Retrieval & Rerank'' paradigm, summarises images with task intent and task-specific visual parsing, and composes language-based demonstrations that reduce token count and alleviate cross-modal interaction problem. Experimental evaluations on five visual reasoning datasets demonstrate the effectiveness of our method. Moreover, our extensive experiments leverage information flow analysis to elucidate the effectiveness of our method, and investigate the impact of length…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Advanced Image and Video Retrieval Techniques · Domain Adaptation and Few-Shot Learning
