TL;DR
This paper introduces a multimodal deep learning model that combines visual and textual data to infer users' latent emotional states from social media posts, outperforming unimodal models and aligning with psychological theories.
Contribution
The paper presents a novel multimodal neural network for emotion inference that integrates image and text analysis, providing interpretable results and psychological insights.
Findings
Multimodal model outperforms unimodal models in emotion prediction.
Model yields interpretable word lists associated with emotions.
Results align with psychological theories of emotion structure.
Abstract
We propose a novel approach to multimodal sentiment analysis using deep neural networks combining visual analysis and natural language processing. Our goal is different than the standard sentiment analysis goal of predicting whether a sentence expresses positive or negative sentiment; instead, we aim to infer the latent emotional state of the user. Thus, we focus on predicting the emotion word tags attached by users to their Tumblr posts, treating these as "self-reported emotions." We demonstrate that our multimodal model combining both text and image features outperforms separate models based solely on either images or text. Our model's results are interpretable, automatically yielding sensible word lists associated with emotions. We explore the structure of emotions implied by our model and compare it to what has been posited in the psychology literature, and validate our model on a…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
