Multi-modal Deep Analysis for Multimedia

Wenwu Zhu; Xin Wang; Hongzhi Li

arXiv:1910.04964·cs.MM·January 7, 2020

Multi-modal Deep Analysis for Multimedia

Wenwu Zhu, Xin Wang, Hongzhi Li

PDF

TL;DR

This paper provides a comprehensive overview of multi-modal deep analysis in multimedia, focusing on data-driven and knowledge-guided fusion techniques for heterogeneous data types like text, images, videos, and audio.

Contribution

It introduces two key scientific problems and explores methods for multi-modal data fusion and knowledge integration, highlighting recent advances and future directions.

Findings

01

Survey of multi-modal deep representation methods

02

Discussion of knowledge-guided fusion approaches

03

Analysis of applications like visual question answering and video summarization

Abstract

With the rapid development of Internet and multimedia services in the past decade, a huge amount of user-generated and service provider-generated multimedia data become available. These data are heterogeneous and multi-modal in nature, imposing great challenges for processing and analyzing them. Multi-modal data consist of a mixture of various types of data from different modalities such as texts, images, videos, audios etc. In this article, we present a deep and comprehensive overview for multi-modal analysis in multimedia. We introduce two scientific research problems, data-driven correlational representation and knowledge-guided fusion for multimedia analysis. To address the two scientific problems, we investigate them from the following aspects: 1) multi-modal correlational representation: multi-modal fusion of data across different modalities, and 2) multi-modal data and knowledge…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.