When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning

Chenyu Zhang; Minsol Kim; Shohreh Ghorbani; Jingyao Wu; Rosalind Picard; Patricia Maes; and Paul Pu Liang

arXiv:2511.02794·cs.AI·November 5, 2025

When One Modality Sabotages the Others: A Diagnostic Lens on Multimodal Reasoning

Chenyu Zhang, Minsol Kim, Shohreh Ghorbani, Jingyao Wu, Rosalind Picard, Patricia Maes, and Paul Pu Liang

PDF

Open Access

TL;DR

This paper introduces a diagnostic framework called modality sabotage to analyze how different modalities influence multimodal model predictions, revealing failure modes and guiding improvements.

Contribution

It proposes a model-agnostic evaluation layer that audits modality contributions and detects sabotage, advancing understanding of multimodal reasoning failures.

Findings

01

Identified systematic reliability profiles in emotion recognition benchmarks.

02

Revealed cases where unimodal errors override multimodal evidence.

03

Provided insights into dataset artifacts versus model limitations.

Abstract

Despite rapid growth in multimodal large language models (MLLMs), their reasoning traces remain opaque: it is often unclear which modality drives a prediction, how conflicts are resolved, or when one stream dominates. In this paper, we introduce modality sabotage, a diagnostic failure mode in which a high-confidence unimodal error overrides other evidence and misleads the fused result. To analyze such dynamics, we propose a lightweight, model-agnostic evaluation layer that treats each modality as an agent, producing candidate labels and a brief self-assessment used for auditing. A simple fusion mechanism aggregates these outputs, exposing contributors (modalities supporting correct outcomes) and saboteurs (modalities that mislead). Applying our diagnostic layer in a case study on multimodal emotion recognition benchmarks with foundation models revealed systematic reliability profiles,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSpeech and dialogue systems · Emotion and Mood Recognition · Multimodal Machine Learning Applications