Modality Collapse as Mismatched Decoding: Information-Theoretic Limits of Multimodal LLMs
Jayadev Billa

TL;DR
This paper analyzes the limitations of multimodal large language models through an information-theoretic lens, revealing how mismatched decoding and training objectives constrain their ability to process non-text modalities effectively.
Contribution
It introduces an information-theoretic framework to understand modality collapse in multimodal LLMs and demonstrates how decoder scoring rules and training objectives influence information accessibility.
Findings
Decoder loss decreases when removing variation in modality-specific directions.
Information loss increases with distributional mismatch and decoder sensitivity.
Training with emotion-related objectives significantly improves emotion detection accuracy.
Abstract
Numerous studies have shown that multimodal LLMs process speech and images well but fail in non-intuitive ways rendering trivial tasks such as object counting unreliable. We investigate this behavior from an information-theoretic perspective by framing multimodal LLM inference as a mismatched decoder problem: a decoder trained primarily on text can only extract information along text-aligned directions (removing up to 98% of the variation in modality-specific (non-text) directions improves decoder loss) and the amount of accessible information is bounded by the Generalized Mutual Information (GMI). We show that information loss is bounded as the distributional mismatch between the source data and the text data increases, and as the sensitivity of the decoder increases. This bound is a function of the model's scoring rule not its architecture. We validate the predictions across five…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Speech Recognition and Synthesis · Multimodal Machine Learning Applications
