TL;DR
This paper introduces the novel task of Pairwise Emotional Relationship Recognition in videos, presents a large multi-modal dataset called ERATO, and proposes a baseline model with a specialized attention mechanism to advance multi-modal video understanding.
Contribution
The paper defines a new PERR task, creates ERATO, a large-scale multi-modal dataset, and develops the SMTA baseline model to improve multi-modal fusion in emotional relationship recognition.
Findings
ERATO contains 31,182 clips and 203 hours of video data.
SMTA improves multi-modal fusion performance by about 1%.
The dataset and model facilitate research in multi-modal emotion and relationship recognition.
Abstract
Recognizing the emotional state of people is a basic but challenging task in video understanding. In this paper, we propose a new task in this field, named Pairwise Emotional Relationship Recognition (PERR). This task aims to recognize the emotional relationship between the two interactive characters in a given video clip. It is different from the traditional emotion and social relation recognition task. Varieties of information, consisting of character appearance, behaviors, facial emotions, dialogues, background music as well as subtitles contribute differently to the final results, which makes the task more challenging but meaningful in developing more advanced multi-modal models. To facilitate the task, we develop a new dataset called Emotional RelAtionship of inTeractiOn (ERATO) based on dramas and movies. ERATO is a large-scale multi-modal dataset for PERR task, which has 31,182…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
