What is Multimodality?
Letitia Parcalabescu, Nils Trost, Anette Frank

TL;DR
This paper critiques outdated definitions of multimodality in machine learning, proposing a new task-relative framework that emphasizes relevant representations for specific tasks to advance language grounding and natural language understanding.
Contribution
It introduces a novel, task-focused definition of multimodality, addressing foundational gaps and guiding future research in multimodal machine learning.
Findings
Highlights limitations of existing definitions
Proposes a task-relative multimodality framework
Aims to improve language grounding and NLU
Abstract
The last years have shown rapid developments in the field of multimodal machine learning, combining e.g., vision, text or speech. In this position paper we explain how the field uses outdated definitions of multimodality that prove unfit for the machine learning era. We propose a new task-relative definition of (multi)modality in the context of multimodal machine learning that focuses on representations and information that are relevant for a given machine learning task. With our new definition of multimodality we aim to provide a missing foundation for multimodal research, an important component of language grounding and a crucial milestone towards NLU.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
What nobody tells you about MULTIMODAL Machine Learning! 🙊 THE definition.· youtube
Taxonomy
TopicsMultimodal Machine Learning Applications · Topic Modeling · Natural Language Processing Techniques
