TL;DR
This paper presents a multi-modal, context-aware deep learning framework for video-based emotion recognition in unconstrained environments, integrating scene attributes, motion, and skeleton data to improve accuracy.
Contribution
It introduces a novel multi-stream architecture combining scene context, motion, and skeleton-based features within a Temporal Segment Network for enhanced emotion recognition.
Findings
Outperforms existing methods on the BoLD dataset
Surpasses previous state-of-the-art recognition scores
Effective integration of scene, motion, and skeleton data
Abstract
In this work we tackle the task of video-based visual emotion recognition in the wild. Standard methodologies that rely solely on the extraction of bodily and facial features often fall short of accurate emotion prediction in cases where the aforementioned sources of affective information are inaccessible due to head/body orientation, low resolution and poor illumination. We aspire to alleviate this problem by leveraging visual context in the form of scene characteristics and attributes, as part of a broader emotion recognition framework. Temporal Segment Networks (TSN) constitute the backbone of our proposed model. Apart from the RGB input modality, we make use of dense Optical Flow, following an intuitive multi-stream approach for a more effective encoding of motion. Furthermore, we shift our attention towards skeleton-based learning and leverage action-centric data as means of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
