TL;DR
This paper presents a novel two-stream model that translates images into auxiliary text to enhance multimodal sentiment classification on Twitter, achieving state-of-the-art results without modifying language models.
Contribution
Introduces an input space translation approach using object-aware transformers to incorporate visual information into language models for multimodal sentiment analysis.
Findings
Achieves state-of-the-art performance on Twitter datasets.
Effectively distills object-level image information into auxiliary text.
Demonstrates the benefit of input translation over internal multimodal model modifications.
Abstract
Multimodal target/aspect sentiment classification combines multimodal sentiment analysis and aspect/target sentiment classification. The goal of the task is to combine vision and language to understand the sentiment towards a target entity in a sentence. Twitter is an ideal setting for the task because it is inherently multimodal, highly emotional, and affects real world events. However, multimodal tweets are short and accompanied by complex, possibly irrelevant images. We introduce a two-stream model that translates images in input space using an object-aware transformer followed by a single-pass non-autoregressive text generation approach. We then leverage the translation to construct an auxiliary sentence that provides multimodal information to a language model. Our approach increases the amount of text available to the language model and distills the object-level information in…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
