TL;DR
HIFICL introduces a novel method for more accurately modeling in-context learning in multimodal models, leading to improved performance on benchmarks by using virtual key-value pairs and low-rank factorization.
Contribution
It proposes a new approach, HIFICL, that better captures the ICL mechanism in multimodal models through learnable context and efficient training techniques.
Findings
HIFICL outperforms existing approximation methods on multimodal benchmarks.
The method effectively models the influence of demonstrations in ICL.
Code is publicly available at the provided GitHub link.
Abstract
In-Context Learning (ICL) is a significant paradigm for Large Multimodal Models (LMMs), using a few in-context demonstrations (ICDs) for new task adaptation. However, its performance is sensitive to demonstration configurations and computationally expensive. Mathematically, the influence of these demonstrations can be decomposed into a dynamic mixture of the standard attention output and the context values. Current approximation methods simplify this process by learning a "shift vector". Inspired by the exact decomposition, we introduce High-Fidelity In-Context Learning (HIFICL) to more faithfully model the ICL mechanism. HIFICL consists of three key components: 1) a set of "virtual key-value pairs" to act as a learnable context, 2) a low-rank factorization for stable and regularized training, and 3) a simple end-to-end training objective. From another perspective, this mechanism…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
