Multi-modal Deep Analysis for Multimedia
Wenwu Zhu, Xin Wang, Hongzhi Li

TL;DR
This paper provides a comprehensive overview of multi-modal deep analysis in multimedia, focusing on data-driven and knowledge-guided fusion techniques for heterogeneous data types like text, images, videos, and audio.
Contribution
It introduces two key scientific problems and explores methods for multi-modal data fusion and knowledge integration, highlighting recent advances and future directions.
Findings
Survey of multi-modal deep representation methods
Discussion of knowledge-guided fusion approaches
Analysis of applications like visual question answering and video summarization
Abstract
With the rapid development of Internet and multimedia services in the past decade, a huge amount of user-generated and service provider-generated multimedia data become available. These data are heterogeneous and multi-modal in nature, imposing great challenges for processing and analyzing them. Multi-modal data consist of a mixture of various types of data from different modalities such as texts, images, videos, audios etc. In this article, we present a deep and comprehensive overview for multi-modal analysis in multimedia. We introduce two scientific research problems, data-driven correlational representation and knowledge-guided fusion for multimedia analysis. To address the two scientific problems, we investigate them from the following aspects: 1) multi-modal correlational representation: multi-modal fusion of data across different modalities, and 2) multi-modal data and knowledge…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
