Mimicking or Reasoning: Rethinking Multi-Modal In-Context Learning in Vision-Language Models
Chengyue Huang, Yuchen Zhu, Sichen Zhu, Jingyun Xiao, Moises Andrade, Shivang Chopra, Zsolt Kira

TL;DR
This paper critically examines multi-modal in-context learning in vision-language models, revealing that current models often rely on copying answers rather than genuine reasoning, especially under distribution shifts, and proposes a new pipeline to improve reasoning capabilities.
Contribution
The paper introduces a new MM-ICL with Reasoning pipeline that incorporates generated rationales, and provides extensive experiments showing current models' limited ability to utilize demonstrations effectively.
Findings
Models often rely on copying answers rather than reasoning.
Performance degrades under distribution shifts with more demonstrations.
Current VLMs show limited sensitivity to factors like shot count and rationale quality.
Abstract
Vision-language models (VLMs) are widely assumed to exhibit in-context learning (ICL), a property similar to that of their language-only counterparts. While recent work suggests VLMs can perform multimodal ICL (MM-ICL), studies show they often rely on shallow heuristics -- such as copying or majority voting -- rather than true task understanding. We revisit this assumption by evaluating VLMs under distribution shifts, where support examples come from a dataset different from the query. Surprisingly, performance often degrades with more demonstrations, and models tend to copy answers rather than learn from them. To investigate further, we propose a new MM-ICL with Reasoning pipeline that augments each demonstration with a generated rationale alongside the answer. We conduct extensive and comprehensive experiments on both perception- and reasoning-required datasets with open-source VLMs…
Peer Reviews
Decision·ICLR 2026 Conference Withdrawn Submission
This paper studies an important direction for VLMs, and presents an analysis of reasoning VLMs. The analysis and findings from the extensive experiments with various VLMs are likely to be of interest to the community.
The findings about performance degradation and pattern copying are not surprising for VLMs, as various previous works have pointed out this issue and published benchmarks to truly benchmark while avoiding the pattern copying issue (VLICL [1], TrueMICL [2], which can be better discussed in the paper). Besides, the motivation of the whole analysis is unclear. This paper classifies the MMICL tasks into two categories (Case I: Well Defined w/o Demos, such as OKVQA and Case II: Ill defined w/o demo
- The idea has a good motivation - Well-written with good task categorization
- Core finding (VLMs use shallow heuristics, not true ICL) already documented in: + Folco Bertini Baldassini, Mustafa Shukor, Matthieu Cord, Laure Soulier, and Benjamin Piwowarski. "What makes multimodal in-context learning work?" In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1539–1550, 2024. + Libo Qin, Qiguang Chen, Hao Fei, Zhi Chen, Min Li, and Wanxiang Che. "What factors affect multi-modal in-context learning? an in-depth exploration." arXiv pre
* Controlled studies are conducted with varying shot count, retrieval method, rationale quality, and distribution. * The paper in general is relatively easy to read. * Extending support examples with reasoning rationales may be novel (but it is not a particularly significant extension). * Various analyses relevant to the topic are conducted.
* Existing literature (already cited in the paper, e.g. Zong et al.) has already shown that MLLMs in general do not benefit from support examples for VQA tasks. Consequently, what is presented as surprising in the paper is already known. In a way the paper discusses existing works that support the given conclusion, but at the same time it seems to present the conclusion as novel and surprising. * The paper focuses only on VQA tasks for studying ICL, but this has been shown (in the cited literatu
Great question, well-motivated. I really like that this paper challenges the default assumption that CoT = reasoning. It brings up a concern that many of us have had but haven’t tested as rigorously. Nice diagnostic design. The proposed metrics (like in-batch reasoning similarity and cross-embedding distances) are intuitive but powerful. They’re easy to apply and tell a clear story. Solid new training method. The RvD framework makes a lot of sense: rather than assuming your CoT demos are good,
Limited generalization across modalities. The paper talks about vision-language reasoning in general, but all tests are focused on text-based CoT. There’s no analysis of visual reasoning paths or failures when the image is critical. No hard negatives in RvD sampling. It feels like RvD just picks “more helpful” demos, but doesn’t actively avoid bad ones. Could the method benefit from adversarial or diverse selection? CoT length tradeoffs are underexplored. Do longer reasoning chains really help
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Advanced Neural Network Applications
