Foundations of Multisensory Artificial Intelligence
Paul Pu Liang

TL;DR
This paper develops theoretical frameworks and practical models for multisensory AI, integrating multiple modalities to improve understanding and application in fields like healthcare, robotics, and multimedia processing.
Contribution
It introduces a formal framework for modality interactions, a large-scale multisensory benchmark, and multimodal architectures that advance the development of general-purpose multisensory AI systems.
Findings
Quantification of modality interactions aids dataset understanding.
MultiBench benchmark enables comprehensive evaluation across modalities.
Multimodal transformers facilitate scalable multisensory AI applications.
Abstract
Building multisensory AI systems that learn from multiple sensory inputs such as text, speech, video, real-world sensors, wearable devices, and medical data holds great promise for impact in many scientific areas with practical benefits, such as in supporting human health and well-being, enabling multimedia content processing, and enhancing real-world autonomous agents. By synthesizing a range of theoretical frameworks and application domains, this thesis aims to advance the machine learning foundations of multisensory AI. In the first part, we present a theoretical framework formalizing how modalities interact with each other to give rise to new information for a task. These interactions are the basic building blocks in all multimodal problems, and their quantification enables users to understand their multimodal datasets, design principled approaches to learn these interactions, and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsAdvanced Computational Techniques in Science and Engineering
