Multi-Granularity Network with Modal Attention for Dense Affective Understanding
Baoming Yan, Lin Wang, Ke Gao, Bo Gao, Xiao Liu, Chao Ban, Jiang Yang,, Xiaobo Li

TL;DR
This paper introduces a multi-granularity network with modal attention for dense affective video understanding, effectively combining multi-level features and modal emphasis to improve frame-level emotion prediction.
Contribution
The paper presents a novel multi-granularity network with modal attention that fuses features at different levels and emphasizes affective-relevant modals for improved dense affective understanding.
Findings
Achieved a correlation score of 0.02292 in the EEV challenge.
Effectively fuses multi-level features for better affective prediction.
Demonstrates the benefit of modal attention in emotion recognition.
Abstract
Video affective understanding, which aims to predict the evoked expressions by the video content, is desired for video creation and recommendation. In the recent EEV challenge, a dense affective understanding task is proposed and requires frame-level affective prediction. In this paper, we propose a multi-granularity network with modal attention (MGN-MA), which employs multi-granularity features for better description of the target frame. Specifically, the multi-granularity features could be divided into frame-level, clips-level and video-level features, which corresponds to visual-salient content, semantic-context and video theme information. Then the modal attention fusion module is designed to fuse the multi-granularity features and emphasize more affection-relevant modals. Finally, the fused feature is fed into a Mixtures Of Experts (MOE) classifier to predict the expressions.…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsVideo Analysis and Summarization · Music and Audio Processing · Human Pose and Action Recognition
