Multimodal LLMs Do Not Compose Skills Optimally Across Modalities
Paula Ontalvilla, Aitor Ormazabal, Gorka Azkune

TL;DR
This paper investigates the ability of Multimodal Large Language Models to compose skills across different modalities, revealing significant gaps and exploring methods to improve their compositional capabilities.
Contribution
It introduces evaluation tasks for cross-modal skill composition and assesses existing MLLMs, proposing prompting and fine-tuning strategies to enhance their compositional performance.
Findings
All models show significant cross-modality skill composition gaps.
Chain-of-thought prompting improves skill composition performance.
Fine-tuning strategies also help but do not fully close the gap.
Abstract
Skill composition is the ability to combine previously learned skills to solve new tasks. As neural networks acquire increasingly complex skills during their pretraining, it is not clear how successfully they can compose them. In this paper, we focus on Multimodal Large Language Models (MLLM), and study their ability to compose skills across modalities. To this end, we design three evaluation tasks which can be solved sequentially composing two modality-dependent skills, and evaluate several open MLLMs under two main settings: i) prompting the model to directly solve the task, and ii) using a two-step cascaded inference approach, which manually enforces the composition of the two skills for a given task. Even with these straightforward compositions, we find that all evaluated MLLMs exhibit a significant cross-modality skill composition gap. To mitigate the aforementioned gap, we explore…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsTopic Modeling · Multimodal Machine Learning Applications · Text Readability and Simplification
