Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning

Cheng Chen; Yunpeng Zhai; Yifan Zhao; Jinyang Gao; Bolin Ding; Jia Li

arXiv:2506.09473·cs.CV·June 12, 2025

Provoking Multi-modal Few-Shot LVLM via Exploration-Exploitation In-Context Learning

Cheng Chen, Yunpeng Zhai, Yifan Zhao, Jinyang Gao, Bolin Ding, Jia Li

PDF

Open Access

TL;DR

This paper introduces a reinforcement learning framework for multi-modal demonstration selection in large vision-language models, significantly improving few-shot learning performance on VQA tasks by optimizing demonstration policies.

Contribution

It proposes a novel exploration-exploitation reinforcement learning approach that adaptively fuses multi-modal information and selects demonstrations, surpassing heuristic methods.

Findings

01

Outperforms existing methods on four VQA datasets

02

Enhances generalization of few-shot LVLMs

03

Demonstrates effective autonomous policy refinement

Abstract

In-context learning (ICL), a predominant trend in instruction learning, aims at enhancing the performance of large language models by providing clear task guidance and examples, improving their capability in task understanding and execution. This paper investigates ICL on Large Vision-Language Models (LVLMs) and explores the policies of multi-modal demonstration selection. Existing research efforts in ICL face significant challenges: First, they rely on pre-defined demonstrations or heuristic selecting strategies based on human intuition, which are usually inadequate for covering diverse task requirements, leading to sub-optimal solutions; Second, individually selecting each demonstration fails in modeling the interactions between them, resulting in information redundancy. Unlike these prevailing efforts, we propose a new exploration-exploitation reinforcement learning framework, which…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling