Learning Relationships between Text, Audio, and Video via Deep Canonical   Correlation for Multimodal Language Analysis

Zhongkai Sun; Prathusha Sarma; William Sethares; Yingyu Liang

arXiv:1911.05544·cs.LG·December 3, 2019·26 cites

Learning Relationships between Text, Audio, and Video via Deep Canonical Correlation for Multimodal Language Analysis

Zhongkai Sun, Prathusha Sarma, William Sethares, Yingyu Liang

PDF

Open Access

TL;DR

This paper introduces ICCN, a deep learning model that learns correlations between text, audio, and video features to improve multimodal language analysis tasks like sentiment and emotion recognition.

Contribution

The paper presents a novel deep canonical correlation analysis model, ICCN, that effectively learns cross-modal correlations for enhanced multimodal language understanding.

Findings

01

ICCN outperforms existing methods on benchmark datasets.

02

ICCN effectively captures correlations between text, audio, and video.

03

Ablation studies confirm the importance of multimodal correlations.

Abstract

Multimodal language analysis often considers relationships between features based on text and those based on acoustical and visual properties. Text features typically outperform non-text features in sentiment analysis or emotion recognition tasks in part because the text features are derived from advanced language models or word embeddings trained on massive data sources while audio and video features are human-engineered and comparatively underdeveloped. Given that the text, audio, and video are describing the same utterance in different ways, we hypothesize that the multimodal sentiment analysis and emotion recognition can be improved by learning (hidden) correlations between features extracted from the outer product of text and audio (we call this text-based audio) and analogous text-based video. This paper proposes a novel model, the Interaction Canonical Correlation Network (ICCN),…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsSentiment Analysis and Opinion Mining · Topic Modeling · Music and Audio Processing