TL;DR
This paper introduces the Tensor Fusion Network, a novel deep learning model designed to effectively capture intra- and inter-modality dynamics for multimodal sentiment analysis, outperforming existing methods.
Contribution
The paper presents a new Tensor Fusion Network that models modality interactions end-to-end, specifically tailored for analyzing sentiment from spoken language, gestures, and voice.
Findings
Outperforms state-of-the-art multimodal sentiment analysis models
Effective modeling of intra- and inter-modality dynamics
Applicable to online videos with spoken language and gestures
Abstract
Multimodal sentiment analysis is an increasingly popular research area, which extends the conventional language-based definition of sentiment analysis to a multimodal setup where other relevant modalities accompany language. In this paper, we pose the problem of multimodal sentiment analysis as modeling intra-modality and inter-modality dynamics. We introduce a novel model, termed Tensor Fusion Network, which learns both such dynamics end-to-end. The proposed approach is tailored for the volatile nature of spoken language in online videos as well as accompanying gestures and voice. In the experiments, our model outperforms state-of-the-art approaches for both multimodal and unimodal sentiment analysis.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
