Sec2Sec Co-attention for Video-Based Apparent Affective Prediction
Mingwei Sun, Kunpeng Zhang

TL;DR
This paper introduces a novel LSTM-Transformer co-attention model for video-based affect prediction, improving accuracy and interpretability by integrating vision, audio, and spatiotemporal cues.
Contribution
It presents a new Sec2Sec Co-attention Transformer that outperforms existing methods and provides interpretability in affect prediction tasks.
Findings
Outperforms state-of-the-art on LIRIS-ACCEDE and First Impressions datasets
Provides interpretability of affective contributions over time
Effective integration of multi-modal video elements
Abstract
Video-based apparent affect detection plays a crucial role in video understanding, as it encompasses various elements such as vision, audio, audio-visual interactions, and spatiotemporal information, which are essential for accurate video predictions. However, existing approaches often focus on extracting only a subset of these elements, resulting in the limited predictive capacity of their models. To address this limitation, we propose a novel LSTM-based network augmented with a Transformer co-attention mechanism for predicting apparent affect in videos. We demonstrate that our proposed Sec2Sec Co-attention Transformer surpasses multiple state-of-the-art methods in predicting apparent affect on two widely used datasets: LIRIS-ACCEDE and First Impressions. Notably, our model offers interpretability, allowing us to examine the contributions of different time points to the overall…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition
