Multimodal Task Interference: A Benchmark and Analysis of History-Target Mismatch in Multimodal LLMs
Masayuki Kawarada, Tatsuya Ishigaki, Hiroya Takamura

TL;DR
This paper introduces a benchmark to evaluate task interference in multimodal LLMs, revealing that modality mismatch causes significant performance drops, especially when switching from text-only to image-based tasks.
Contribution
It provides the first systematic benchmark for task interference in multimodal LLMs, analyzing the effects of history-target mismatch across multiple dimensions.
Findings
Switching from text-only to image targets causes severe performance drops.
Interference is amplified when multiple mismatches co-occur.
Modality differences drive the strongest interference effects.
Abstract
Task interference, the performance degradation caused by task switches within a single conversation, has been studied exclusively in text-only settings despite the growing prevalence of multimodal dialogue systems. We introduce a benchmark for evaluating this phenomenon in multimodal LLMs, covering six tasks across text and vision with systematic variation of history-target along three axes: modality mismatch, reasoning mismatch, and answer format mismatch. Experiments on both open-weights and proprietary models reveal that task interference is highly directional: switching from text-only to image-based targets causes severe performance drops, while the reverse transition yields minimal degradation. Interference is further amplified when mismatches co-occur across multiple dimensions, and is driven most strongly by modality differences, followed by answer format, while reasoning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsSpeech and dialogue systems · Topic Modeling · Multimodal Machine Learning Applications
