Text Takes Over: A Study of Modality Bias in Multimodal Intent Detection
Ankan Mullick, Saransh Sharma, Abhik Jana, Pawan Goyal

TL;DR
This study reveals that dataset modality bias heavily influences multimodal intent detection performance, with text-only models outperforming multimodal ones due to dataset bias, highlighting the need for unbiased datasets.
Contribution
The paper identifies modality bias in multimodal intent datasets, proposes a debiasing framework, and analyzes the impact of bias on model performance and modality relevance.
Findings
Mistral-7B outperforms multimodal models by 9% on MIntRec-1.
Over 90% of samples are text-biased in datasets.
Debiasing reduces dataset size and significantly drops model accuracy.
Abstract
The rise of multimodal data, integrating text, audio, and visuals, has created new opportunities for studying multimodal tasks such as intent detection. This work investigates the effectiveness of Large Language Models (LLMs) and non-LLMs, including text-only and multi-modal models, in the multimodal intent detection task. Our study reveals that Mistral-7B, a text-only LLM, outperforms most competitive multimodal models by approximately 9% on MIntRec-1 and 4% on MIntRec2.0 datasets. This performance advantage comes from a strong textual bias in these datasets, where over 90% of the samples require textual input, either alone or in combination with other modalities, for correct classification. We confirm the modality bias of these datasets via human evaluation, too. Next, we propose a framework to debias the datasets, and upon debiasing, more than 70% of the samples in MIntRec-1 and more…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
