Enhancing Multimodal In-Context Learning via Inductive-Deductive Reasoning
Haoyu Wang, Haonan Wang, Yuyan Chen, Jun Chen, Gang Liu, Qian Wang, Jiahong Yan, Yanghua Xiao

TL;DR
This paper proposes a new framework for multimodal in-context learning that enhances reasoning and rule extraction in vision-language models through inductive-deductive restructuring, visual token filtering, and attention balancing.
Contribution
It introduces a structured inductive-deductive approach with visual token compression, attention rebalancing, and chain-of-thought prompting, improving multimodal ICL performance.
Findings
Significant performance improvements across eight benchmarks.
Enhanced reasoning and rule extraction capabilities.
Better handling of visual redundancy and attention skew.
Abstract
In-context learning (ICL) allows large models to adapt to tasks using a few examples, yet its extension to vision-language models (VLMs) remains fragile. Our analysis reveals that the fundamental limitation lies in an inductive gap, models often produce correct answers from flawed reasoning, while struggling to extract consistent rules across demonstrations. This gap is further exacerbated by two visual-level obstacles: an overwhelming proportion of redundant visual tokens that obscure textual cues, and a skewed attention distribution that favors the initial image at the expense of subsequent context. To address these issues, we introduce a framework that restructures multimodal ICL as a principled inductive-deductive process. The framework incorporates a similarity-based visual token compression module to filter out redundant patches, a dynamic attention rebalancing mechanism to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
