Multimodal Intelligence: Representation Learning, Information Fusion, and Applications
Chao Zhang, Zichao Yang, Xiaodong He, Li Deng

TL;DR
This paper reviews recent advances in multimodal deep learning, focusing on representation learning, signal fusion, and applications involving vision and language, highlighting key methods and future research directions.
Contribution
It provides a comprehensive analysis of multimodal models, emphasizing recent techniques in representation, fusion, and applications, to guide future research in multimodal intelligence.
Findings
Unified embedding techniques enable cross-modality processing.
Special architectures improve multimodal signal integration.
Applications include captioning, image generation, and visual question answering.
Abstract
Deep learning methods have revolutionized speech recognition, image recognition, and natural language processing since 2010. Each of these tasks involves a single modality in their input signals. However, many applications in the artificial intelligence field involve multiple modalities. Therefore, it is of broad interest to study the more difficult and complex problem of modeling and learning across multiple modalities. In this paper, we provide a technical review of available models and learning methods for multimodal intelligence. The main focus of this review is the combination of vision and natural language modalities, which has become an important topic in both the computer vision and natural language processing research communities. This review provides a comprehensive analysis of recent works on multimodal deep learning from three perspectives: learning multimodal…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
