Fusion or Confusion? Multimodal Complexity Is Not All You Need
Tillmann Rheude, Roland Eils, Benjamin Wild

TL;DR
This study critically evaluates the effectiveness of complex multimodal architectures, revealing that increased complexity often causes confusion rather than improved data fusion, and advocates for methodological rigor over architectural novelty.
Contribution
It provides a large-scale empirical comparison showing that simpler models often outperform complex multimodal architectures and highlights the need for standardized evaluation practices.
Findings
Complex multimodal models do not reliably outperform unimodal baselines.
Increased multimodal complexity often leads to confusion rather than better performance.
Top-tier publications have methodological shortcomings that need addressing.
Abstract
Multimodal learning has become a prominent research area, with the potential of substantial performance gains by combining information across modalities. At the same time, model development has trended toward increasingly complex deep learning architectures, motivated by the assumption that multimodal-specific methods improve performance. We challenge this assumption through a large-scale empirical study by reimplementing 19 high-impact multimodal methods across nine diverse datasets with up to 23 modalities. Under standardized experimental conditions, including hyperparameter tuning, weight initialization, cross-validation, and statistical testing, increased multimodal complexity often yields confusion rather than effective fusion of data modalities. Accordingly, complex multimodal architectures do not reliably outperform unimodal baselines and a Simple Baseline for Multimodal Learning…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
