Does my multimodal model learn cross-modal interactions? It's harder to tell than you might think!
Jack Hessel, Lillian Lee

TL;DR
This paper introduces EMAP, a diagnostic tool to determine whether cross-modal interactions genuinely enhance multimodal model performance, revealing that many high-performing models rely more on unimodal signals than true interactions.
Contribution
The paper proposes EMAP, a novel method to isolate and evaluate the impact of cross-modal interactions in multimodal models, challenging assumptions about their importance.
Findings
Removing cross-modal interactions often does not degrade performance.
High-capacity models may outperform others without leveraging interactions.
Many models rely on unimodal signals despite high performance.
Abstract
Modeling expressive cross-modal interactions seems crucial in multimodal tasks, such as visual question answering. However, sometimes high-performing black-box algorithms turn out to be mostly exploiting unimodal signals in the data. We propose a new diagnostic tool, empirical multimodally-additive function projection (EMAP), for isolating whether or not cross-modal interactions improve performance for a given model on a given task. This function projection modifies model predictions so that cross-modal interactions are eliminated, isolating the additive, unimodal structure. For seven image+text classification tasks (on each of which we set new state-of-the-art benchmarks), we find that, in many cases, removing cross-modal interactions results in little to no performance degradation. Surprisingly, this holds even when expressive models, with capacity to consider interactions, otherwise…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
