# Spatial-Temporal Recurrent Neural Network for Emotion Recognition

**Authors:** Tong Zhang (1, 2), Wenming Zheng (2), Zhen Cui (2), Yuan Zong (2), and Yang Li (1, 2) ((1) the Department of Information Science and, Engineering, Southeast University, Nanjing, China (2) the Key Laboratory of, Child Development, Learning Science of Ministry of Education, Research, Center for Learning Science, Southeast University, Nanjing, China)

arXiv: 1705.04515 · 2018-05-10

## TL;DR

This paper introduces a novel deep learning framework called STRNN that unifies spatial and temporal analysis of EEG and facial signals for emotion recognition, demonstrating superior performance on public datasets.

## Contribution

The paper proposes a new spatial-temporal RNN model that captures dependencies across both spatial and temporal domains for emotion recognition from EEG and face signals.

## Key findings

- STRNN outperforms state-of-the-art methods on public emotion datasets.
- The model effectively captures spatial and temporal dependencies.
- Sparse projection enhances discriminative ability of the model.

## Abstract

Emotion analysis is a crucial problem to endow artifact machines with real intelligence in many large potential applications. As external appearances of human emotions, electroencephalogram (EEG) signals and video face signals are widely used to track and analyze human's affective information. According to their common characteristics of spatial-temporal volumes, in this paper we propose a novel deep learning framework named spatial-temporal recurrent neural network (STRNN) to unify the learning of two different signal sources into a spatial-temporal dependency model. In STRNN, to capture those spatially cooccurrent variations of human emotions, a multi-directional recurrent neural network (RNN) layer is employed to capture longrange contextual cues by traversing the spatial region of each time slice from multiple angles. Then a bi-directional temporal RNN layer is further used to learn discriminative temporal dependencies from the sequences concatenating spatial features of each time slice produced from the spatial RNN layer. To further select those salient regions of emotion representation, we impose sparse projection onto those hidden states of spatial and temporal domains, which actually also increases the model discriminant ability because of this global consideration. Consequently, such a two-layer RNN model builds spatial dependencies as well as temporal dependencies of the input signals. Experimental results on the public emotion datasets of EEG and facial expression demonstrate the proposed STRNN method is more competitive over those state-of-the-art methods.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1705.04515/full.md

## Figures

6 figures with captions in the complete paper: https://tomesphere.com/paper/1705.04515/full.md

## References

28 references — full list in the complete paper: https://tomesphere.com/paper/1705.04515/full.md

---
Source: https://tomesphere.com/paper/1705.04515