TL;DR
This paper introduces an enhanced adversarial network that combines spatial and temporal features with attention mechanisms for improved facial affect estimation in videos, demonstrating superior results on benchmark datasets.
Contribution
It proposes a novel model with latent feature-based temporal modeling, adversarial training, and attention modules for more accurate spatio-temporal affect estimation.
Findings
Temporal modeling improves affect estimation accuracy.
Attention mechanisms significantly enhance performance.
Sequence length of around 160 ms is optimal for temporal features.
Abstract
Affective Computing has recently attracted the attention of the research community, due to its numerous applications in diverse areas. In this context, the emergence of video-based data allows to enrich the widely used spatial features with the inclusion of temporal information. However, such spatio-temporal modelling often results in very high-dimensional feature spaces and large volumes of data, making training difficult and time consuming. This paper addresses these shortcomings by proposing a novel model that efficiently extracts both spatial and temporal features of the data by means of its enhanced temporal modelling based on latent features. Our proposed model consists of three major networks, coined Generator, Discriminator, and Combiner, which are trained in an adversarial setting combined with curriculum learning to enable our adaptive attention modules. In our experiments, we…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
