# Spatio-Temporal Facial Expression Recognition Using Convolutional Neural   Networks and Conditional Random Fields

**Authors:** Behzad Hasani, Mohammad H. Mahoor

arXiv: 1703.06995 · 2020-04-17

## TL;DR

This paper introduces a combined deep neural network and Conditional Random Field approach for improved facial expression recognition in videos, demonstrating superior performance across multiple datasets and settings.

## Contribution

It presents a novel two-part network integrating CNNs and CRFs for spatio-temporal facial expression recognition, outperforming existing methods.

## Key findings

- Outperforms state-of-the-art in cross-database tests.
- Achieves comparable results in subject-independent tests.
- Cascading CNN with CRF enhances video-based FER accuracy.

## Abstract

Automated Facial Expression Recognition (FER) has been a challenging task for decades. Many of the existing works use hand-crafted features such as LBP, HOG, LPQ, and Histogram of Optical Flow (HOF) combined with classifiers such as Support Vector Machines for expression recognition. These methods often require rigorous hyperparameter tuning to achieve good results. Recently Deep Neural Networks (DNN) have shown to outperform traditional methods in visual object recognition. In this paper, we propose a two-part network consisting of a DNN-based architecture followed by a Conditional Random Field (CRF) module for facial expression recognition in videos. The first part captures the spatial relation within facial images using convolutional layers followed by three Inception-ResNet modules and two fully-connected layers. To capture the temporal relation between the image frames, we use linear chain CRF in the second part of our network. We evaluate our proposed network on three publicly available databases, viz. CK+, MMI, and FERA. Experiments are performed in subject-independent and cross-database manners. Our experimental results show that cascading the deep network architecture with the CRF module considerably increases the recognition of facial expressions in videos and in particular it outperforms the state-of-the-art methods in the cross-database experiments and yields comparable results in the subject-independent experiments.

## Full text

_Full body text omitted from this summary view._ Fetch the complete paper as Markdown: https://tomesphere.com/paper/1703.06995/full.md

## Figures

2 figures with captions in the complete paper: https://tomesphere.com/paper/1703.06995/full.md

## References

43 references — full list in the complete paper: https://tomesphere.com/paper/1703.06995/full.md

---
Source: https://tomesphere.com/paper/1703.06995