Learning to Select Visual In-Context Demonstrations
Eugene Lee, Yu-Chi Lin, Jiajie Diao

TL;DR
This paper introduces LSD, a reinforcement learning approach for selecting demonstrations in multimodal large language models, improving performance on factual regression tasks by balancing relevance and diversity.
Contribution
It reframes demonstration selection as a sequential decision problem and trains an RL agent to optimize demonstration sets for better visual ICL performance.
Findings
LSD outperforms kNN on objective factual regression benchmarks.
kNN remains effective for subjective preference tasks.
LSD balances relevance and diversity to define regression boundaries.
Abstract
Multimodal Large Language Models (MLLMs) adapt to visual tasks via in-context learning (ICL), which relies heavily on demonstration quality. The dominant demonstration selection strategy is unsupervised k-Nearest Neighbor (kNN) search. While simple, this similarity-first approach is sub-optimal for complex factual regression tasks; it selects redundant examples that fail to capture the task's full output range. We reframe selection as a sequential decision-making problem and introduce Learning to Select Demonstrations (LSD), training a Reinforcement Learning agent to construct optimal demonstration sets. Using a Dueling DQN with a query-centric Transformer Decoder, our agent learns a policy that maximizes MLLM downstream performance. Evaluating across five visual regression benchmarks, we uncover a crucial dichotomy: while kNN remains optimal for subjective preference tasks, LSD…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
