Lightweight In-Context Tuning for Multimodal Unified Models
Yixin Chen, Shuai Zhang, Boran Han, Jiaya Jia

TL;DR
This paper introduces M$^2$IXT, a lightweight module that significantly improves in-context learning for multimodal models across various tasks with minimal data and model size.
Contribution
The paper proposes M$^2$IXT, a novel, adaptable module that enhances few-shot in-context learning in multimodal models, achieving state-of-the-art results with fewer parameters.
Findings
Boosts few-shot ICL performance by up to 18%.
Achieves state-of-the-art results across multiple multimodal tasks.
Model is approximately 20 times smaller than comparable methods.
Abstract
In-context learning (ICL) involves reasoning from given contextual examples. As more modalities comes, this procedure is becoming more challenging as the interleaved input modalities convolutes the understanding process. This is exemplified by the observation that multimodal models often struggle to effectively extrapolate from contextual examples to perform ICL. To address these challenges, we introduce MultiModal In-conteXt Tuning (MIXT), a lightweight module to enhance the ICL capabilities of multimodal unified models. The proposed MIXT module perceives an expandable context window to incorporate various labeled examples of multiple modalities (e.g., text, image, and coordinates). It can be prepended to various multimodal unified models (e.g., OFA, Unival, LLaVA) of different architectures and trained via a mixed-tasks strategy to enable rapid few-shot adaption on multiple…
Peer Reviews
Decision·ICLR 2024 Conference Withdrawn Submission
1. M2IXT is a general plug-in module that can be applied to any multimodal models. 2. Experiments validate the effectiveness of M2IXT to improve the few-shot learning capabilities of existing multimodal unified models.
Let me first roughly define two types/eras of (multimodal) models. 1) Finetuning-style pretrained models (old-era), like BERT, BEiT-3, OFA. They show great finetuning benchmark performance. Some of them rely more on academicly curated labeled datasets for training, which could contribute to their benchmark performance. But they are not scalable. For these models, few-shot ability are typically not the focus. 2) Large models (new-era), like GPT-3, Flamingo. For these models, we mainly focus on th
+ I agree with that the capability of in-context learning (ICL) is very important for multimodal foundation models because the ultimate goal of such approaches is making AGI, and ICL makes it eaiser to infer on unseen tasks and data, as shown in Section 4.3. From this point of view, I think the effectiveness of the proposed method endowing multimodal unified models with the ICL ability is quite high. + With a few training data and a lightweight additional module, the proposed method achieves per
I cannot give this paper a high rating for the following reasons. - The writing and presentation should be improved. It was quite hard to follow in several places when I first read the paper. Especially, the authors should revise the approach section (Section 3) for describing the architecture of M2IXT and training procedure in more detail. Figure 2 does not contain the details of the module. - I think that one of the biggest advantages of ICL is enabling few-show adaptation to unseen tasks. How
1. The paper is well-written and nicely-structured. 2. The proposed can be prepended to various multimodal unified models of different architectures and is easy to train. 3. Experiments show M2IXT can significantly boost the few-shot ICL performance.
Please see the question part below
Contribution: According to the author's statement, the main contribution of this paper is to enhance the contextual capabilities of existing multi-modal models via continuing to train the model by introducing in-context examples. This method is not novel, due to similar method has been explored in previous method “MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning”. In addition, in NLP area, there are also many works to introduce in-context examples to improve the ove
Contributions and Methods: Multimodal In-Context Tuning: The paper builds upon the concept of multimodal in-context tuning as previously explored in "MMICL." In comparison to prior works, the authors introduce additional trainable parameters within multimodal language models, which include embedding tables and a target embedding table. However, the rationale behind incorporating these specific modules is not adequately explained. Particularly, the method section lacks clarity in describing the
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsMultimodal Machine Learning Applications · Domain Adaptation and Few-Shot Learning · Topic Modeling
MethodsOFA
