Shallow Triple Stream Three-dimensional CNN (STSTNet) for   Micro-expression Recognition

Sze-Teng Liong; Y.S. Gan; John See; Huai-Qian Khor and; Yen-Chang Huang

arXiv:1902.03634·cs.CV·August 22, 2019

Shallow Triple Stream Three-dimensional CNN (STSTNet) for Micro-expression Recognition

Sze-Teng Liong, Y.S. Gan, John See, Huai-Qian Khor and, Yen-Chang Huang

PDF

1 Repo

TL;DR

This paper introduces a lightweight 3D CNN model called STSTNet for micro-expression recognition, effectively capturing high-level features from optical flow data with competitive accuracy.

Contribution

The paper presents a shallow, computationally efficient 3D CNN architecture that leverages optical flow features for improved micro-expression recognition.

Findings

01

Achieved an unweighted average recall of 0.7605

02

Obtained an unweighted F1-score of 0.7353

03

Demonstrated effectiveness on multiple micro-expression datasets

Abstract

In the recent year, state-of-the-art for facial micro-expression recognition have been significantly advanced by deep neural networks. The robustness of deep learning has yielded promising performance beyond that of traditional handcrafted approaches. Most works in literature emphasized on increasing the depth of networks and employing highly complex objective functions to learn more features. In this paper, we design a Shallow Triple Stream Three-dimensional CNN (STSTNet) that is computationally light whilst capable of extracting discriminative high level features and details of micro-expressions. The network learns from three optical flow features (i.e., optical strain, horizontal and vertical optical flow fields) computed based on the onset and apex frames of each video. Our experimental results demonstrate the effectiveness of the proposed STSTNet, which obtained an unweighted…

Figures1

Click any figure to enlarge with its caption.

Tables9

Table 1. TABLE I : STSTNet configuration for the convolutional ( C ) layers, pooling ( P ) layers, fully connected ( FC ) layer and output sofmax layer

Layer	Filter size	# Filters	Stride	Padding	Output size
C1-1	3 $\times$ 3 $\times$ 3	3	[1,1]	1	28 $\times$ 28 $\times$ 3
C1-2	3 $\times$ 3 $\times$ 3	5	[1,1]	1	28 $\times$ 28 $\times$ 5
C1-3	3 $\times$ 3 $\times$ 3	8	[1,1]	1	28 $\times$ 28 $\times$ 8
P1-1	3 $\times$ 3	-	[3,3]	1	10 $\times$ 10 $\times$ 3
P1-2	3 $\times$ 3	-	[3,3]	1	10 $\times$ 10 $\times$ 5
P1-3	3 $\times$ 3	-	[3,3]	1	10 $\times$ 10 $\times$ 8
P2	2 $\times$ 2	-	[2,2]	0	5 $\times$ 5 $\times$ 16
FC	-	-	-	-	400 $\times$ 1
Softmax	-	-	-	-	3 $\times$ 1

Table 2. TABLE II : Detailed information of the three merged ME databases

Database		CASME II	SMIC	SAMM
Subjects		24	16	28
Samples		145	164	133
Frame rate (fps)		200	100	200
Cropped image resolution (pixels)		170 $\times$ 140	170 $\times$ 140	170 $\times$ 140
Frame number	Average	70	34	73
	Maximum	126	58	101
	Minimum	24	11	30
Video duration (s)	Average	0.35	0.34	0.36
	Maximum	0.63	0.58	0.51
	Minimum	0.12	0.11	0.15
Expression	Negative	88	70	92
	Positive	32	51	26
	Surprise	25	43	15
Ground-truth annotations	Onset index	✓	✓	✓
	Offset index	✓	✓	✓
	Apex index	✓	✗	✓

Table 3. TABLE III : Comparison of micro-expression recognition performance in terms of Accuracy ( Acc ), F1-score, Unweighted F1-score ( UF1 ) and Unweighted Average Recall ( UAR ) on the composite ( Full ), CASME II, SMIC and SAMM databases

No.	Methods	Full				SMIC		CASME II		SAMM
No.	Methods	Acc	F1-score	UF1	UAR	UF1	UAR	UF1	UAR	UF1	UAR
1	LBP-TOP baseline	-	-	0.5882	0.5785	0.2000	0.5280	0.7026	0.7429	0.3954	0.4102
2	Bi-WOOF [21]	0.6833	0.6304	0.6296	0.6227	0.5727	0.5829	0.7805	0.8026	0.5211	0.5139
3	AlexNet [12]	0.7308	0.6959	0.6933	0.7154	0.6201	0.6373	0.7994	0.8312	0.6104	0.6642
4	SqueezeNet [10]	0.6380	0.5964	0.5930	0.6166	0.5381	0.5603	0.6894	0.7278	0.5039	0.5362
5	GoogLeNet [28]	0.6335	0.5698	0.5573	0.6049	0.5123	0.5511	0.5989	0.6414	0.5124	0.5992
6	VGG16 [27]	0.6833	0.6439	0.6425	0.6516	0.5800	0.5964	0.8166	0.8202	0.4870	0.4793
7	OFF-ApexNet [7]	0.7460	0.7104	0.7196	0.7096	0.6817	0.6695	0.8764	0.8681	0.5409	0.5392
8	STSTNet	0.7692	0.7389	0.7353	0.7605	0.6801	0.7013	0.8382	0.8686	0.6588	0.6810

Table 4. TABLE IV : The confusion matrix of STSTNet on Full , SMIC, CASME II and SAMM databases (measured by recognition rate %)

	Neg	Pos	Sur
Neg	87.60	8.80	3.60
Pos	36.70	56.88	6.42
Sur	25.30	3.61	71.08

Table 5. (a) Full

	Neg	Pos	Sur
Neg	87.60	8.80	3.60
Pos	36.70	56.88	6.42
Sur	25.30	3.61	71.08

Table 6. (b) SMIC

	Neg	Pos	Sur
Neg	77.14	14.29	8.57
Pos	33.33	58.82	7.84
Sur	32.56	2.33	65.12

Table 7. (c) CASME II

	Neg	Pos	Sur
Neg	94.32	5.68	0
Pos	37.50	59.38	3.13
Sur	8	0	92

Table 8. (d) SAMM

	Neg	Pos	Sur
Neg	89.13	7.61	3.26
Pos	42.31	50.00	7.69
Sur	33.33	13.33	53.33

Table 9. TABLE V : Key properties of the competing neural networks

Network	Depth	Parameter (Million)	Image Input Size	Execution Time (s)
STSTNet	2	0.00167	28 $\times$ 28 $\times$ 3	5.7366
OFF-ApexNet [7]	5	2.77	28 $\times$ 28 $\times$ 2	5.5632
AlexNet [12]	8	61	227 $\times$ 227 $\times$ 3	12.9007
SqueezeNet [10]	18	1.24	227 $\times$ 227 $\times$ 3	14.3704
GoogLeNet [28]	22	7	224 $\times$ 224 $\times$ 3	29.3022
VGG16 [27]	16	138	224 $\times$ 224 $\times$ 3	95.4436

Equations28

d = \frac{\sum _{i = 1}^{B} h _{1 i} \times h _{2 i}}{\sum _{i = 1}^{B} h _{1 i}^{2} \times \sum _{i = 1}^{B} h _{2 i}^{2}}

d = \frac{\sum _{i = 1}^{B} h _{1 i} \times h _{2 i}}{\sum _{i = 1}^{B} h _{1 i}^{2} \times \sum _{i = 1}^{B} h _{2 i}^{2}}

s_{i} = {f_{i, j} ∣ i = 1, \dots, n; j = 1, \dots, F_{i}},

s_{i} = {f_{i, j} ∣ i = 1, \dots, n; j = 1, \dots, F_{i}},

O_{i} = {(u (x, y), v (x, y)) ∣ x = 1, 2, ..., X, y = 1, ..., Y},

O_{i} = {(u (x, y), v (x, y)) ∣ x = 1, 2, ..., X, y = 1, ..., Y},

ε = \frac{1}{2} [\nabla u + (\nabla u)^{T}],

ε = \frac{1}{2} [\nabla u + (\nabla u)^{T}],

ε = ε_{xx} = \frac{\partial u}{\partial x} ε_{y x} = \frac{1}{2} (\frac{\partial v}{\partial x} + \frac{\partial u}{\partial y}) ε_{x y} = \frac{1}{2} (\frac{\partial u}{\partial y} + \frac{\partial v}{\partial x}) ε_{y y} = \frac{\partial v}{\partial y},

ε = ε_{xx} = \frac{\partial u}{\partial x} ε_{y x} = \frac{1}{2} (\frac{\partial v}{\partial x} + \frac{\partial u}{\partial y}) ε_{x y} = \frac{1}{2} (\frac{\partial u}{\partial y} + \frac{\partial v}{\partial x}) ε_{y y} = \frac{\partial v}{\partial y},

∣ ε_{x, y} ∣ = \frac{\partial u}{\partial x}^{2} + \frac{\partial v}{\partial y}^{2} + \frac{1}{2} (\frac{\partial u}{\partial x} + \frac{\partial u}{\partial x})^{2} .

∣ ε_{x, y} ∣ = \frac{\partial u}{\partial x}^{2} + \frac{\partial v}{\partial y}^{2} + \frac{1}{2} (\frac{\partial u}{\partial x} + \frac{\partial u}{\partial x})^{2} .

Accuracy := \frac{\sum _{α = 1}^{M} \sum _{β = 1}^{k} T P _{α}^{β}}{\sum _{α = 1}^{M} \sum _{β = 1}^{k} T P _{α}^{k} + \sum _{α = 1}^{M} \sum _{β = 1}^{k} F P _{α}^{k}}

Accuracy := \frac{\sum _{α = 1}^{M} \sum _{β = 1}^{k} T P _{α}^{β}}{\sum _{α = 1}^{M} \sum _{β = 1}^{k} T P _{α}^{k} + \sum _{α = 1}^{M} \sum _{β = 1}^{k} F P _{α}^{k}}

F1-score := \frac{2 \times Precision \times Recall}{Precision + Recall}

F1-score := \frac{2 \times Precision \times Recall}{Precision + Recall}

Recall := α = 1 \sum M \frac{\sum _{β = 1}^{k} T P _{α}^{β}}{M \times \sum _{β = 1}^{k} T P _{α}^{β} + \sum _{β = 1}^{k} F N _{α}^{β}}

Recall := α = 1 \sum M \frac{\sum _{β = 1}^{k} T P _{α}^{β}}{M \times \sum _{β = 1}^{k} T P _{α}^{β} + \sum _{β = 1}^{k} F N _{α}^{β}}

Precision := α = 1 \sum M \frac{\sum _{β = 1}^{k} T P _{α}^{β}}{M \times \sum _{β = 1}^{k} T P _{α}^{β} + \sum _{β = 1}^{k} F P _{α}^{β}}

Precision := α = 1 \sum M \frac{\sum _{β = 1}^{k} T P _{α}^{β}}{M \times \sum _{β = 1}^{k} T P _{α}^{β} + \sum _{β = 1}^{k} F P _{α}^{β}}

UF1 := 2 \times \frac{\sum _{α = 1}^{M} \frac{Precision _{α} \times Recall _{α}}{Precision _{α} + Recall _{α}}}{M}

UF1 := 2 \times \frac{\sum _{α = 1}^{M} \frac{Precision _{α} \times Recall _{α}}{Precision _{α} + Recall _{α}}}{M}

Precision_{α} := \frac{\sum _{β = 1}^{k} T P _{α}^{β}}{\sum _{β = 1}^{k} T P _{α}^{β} + \sum _{β = 1}^{k} F P _{α}^{β}}

Precision_{α} := \frac{\sum _{β = 1}^{k} T P _{α}^{β}}{\sum _{β = 1}^{k} T P _{α}^{β} + \sum _{β = 1}^{k} F P _{α}^{β}}

Recall_{α} := \frac{\sum _{β = 1}^{k} T P _{α}^{β}}{\sum _{β = 1}^{k} T P _{α}^{β} + \sum _{β = 1}^{k} F N _{α}^{β}}

Recall_{α} := \frac{\sum _{β = 1}^{k} T P _{α}^{β}}{\sum _{β = 1}^{k} T P _{α}^{β} + \sum _{β = 1}^{k} F N _{α}^{β}}

UAR := \frac{1}{M} α = 1 \sum M \frac{\sum _{β = 1}^{k} T P _{α}^{β}}{M \times \sum _{β = 1}^{k} T P _{α}^{β} + \sum _{β = 1}^{k} F N _{α}^{β}}

UAR := \frac{1}{M} α = 1 \sum M \frac{\sum _{β = 1}^{k} T P _{α}^{β}}{M \times \sum _{β = 1}^{k} T P _{α}^{β} + \sum _{β = 1}^{k} F N _{α}^{β}}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

christy1206/STSTNet
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Shallow Triple Stream Three-dimensional CNN (STSTNet) for Micro-expression Recognition

Sze-Teng Liong1, Y.S. Gan2, John See3, Huai-Qian Khor3, Yen-Chang Huang4

1Department of Electronic Engineering, Feng Chia University, Taichung 40724, Taiwan R.O.C.

2 Department of Info. Management, National Taipei University of Nursing and Health Sciences, Taiwan R.O.C.

3 Faculty of Computing and Informatics, Multimedia University, 63100 Cyberjaya, Malaysia

4 School of Mathematics and Statistics, Xinyang Normal University, Henan, China

Abstract

In the recent year, state-of-the-art for facial micro-expression recognition have been significantly advanced by deep neural networks. The robustness of deep learning has yielded promising performance beyond that of traditional handcrafted approaches. Most works in literature emphasized on increasing the depth of networks and employing highly complex objective functions to learn more features. In this paper, we design a Shallow Triple Stream Three-dimensional CNN (STSTNet) that is computationally light whilst capable of extracting discriminative high level features and details of micro-expressions. The network learns from three optical flow features (i.e., optical strain, horizontal and vertical optical flow fields) computed based on the onset and apex frames of each video. Our experimental results demonstrate the effectiveness of the proposed STSTNet, which obtained an unweighted average recall rate of 0.7605 and unweighted F1-score of 0.7353 on the composite database consisting of 442 samples from the SMIC, CASME II and SAMM databases.

I Introduction

Facial expressions are a form of nonverbal communication created by facial muscle contractions during emotional states. Different muscular movements and patterns eventually reflect different types of emotions. However, the expressions portray on the face may not accurately imply one’s emotion state as it can be faked easily.

Among several types of nonverbal communications (i.e., facial expression, vocal intonation and body posture), micro-expression (ME) is discovered to be the likeliest to reveal one’s deepest emotions [4]. Since the ME is stimulated involuntarily, it allows the competence in exposing one’s concealed genuine feelings without deliberate control. In contrast to facial macro-expressions, which normally lasts between 0.75s to 2s, micro-expression usually occurs in less duration (0.04s to 0.2s) and lower intensity [5].

In recent years, there has been a growing interest in incorporating computer vision techniques in automated ME recognition systems. The state-of-the-art approaches for ME recognition (based on original protocol) have obtained accuracy levels less than 70% [1, 14], though tested on the datasets constructed in constrained laboratory environment. In contrast, normal (macro-) expression recognition systems can exhibit almost perfect recognition accuracy [11, 26]. Meanwhile, most of the ME videos are captured using high frame rate cameras (i.e. $>$ 100 $fps$ ) and resulting in a lot of redundant frames. Hence, it is essential to eliminate the overload of unnecessary facial information while highlighting important characteristics and cues of ME movements. Temporal Interpolation Method (TIM) [35] is one of the techniques used in ME systems to address the problem of different video lengths [14, 30, 25]. It normalizes the length of all image sequences to a certain fixed length, either through downsampling or upsampling. TIM was adopted by the original ME databases to standardize the frame length before feature extraction. Moving along these lines, Liong et al. [21, 7] proposed to identify the ME category by using only information from a single apex frame (i.e., frame with highest emotion intensity). They demonstrated that it is sufficient to encode ME features by utilizing the apex and a neutral reference frame (typically the onset).

For feature extraction, numerous researchers proposed algorithms based on Local Binary Pattern (LBP) [23], such as Local Binary Pattern on Three-Orthogonal Planes (LBP-TOP) [34], Local Binary Pattern with Six Intersection Points (LBP-SIP) [31], Local Binary Pattern with Mean Orthogonal Planes (LBP-MOP) [32] and Spatiotemporal Completed Local Quantization Patterns (STCLQP) [9]. LBP is a texture-based feature extraction method with characteristics of good discrimination ability, compact representation and low computational complexity. Meanwhile, some works favor optical flow features that estimate the frame-level motions based on the change in brightness intensities between frames, which is capable of capturing subtle facial movements. Optical flow-based approaches include Optical Strain Feature (OSF) [17], Optical Strain Weight (OSW) [18], Fuzzy Histogram of Oriented Optical Flow (FHOOF) [8], Main Directional Mean optical flow (MDMO) [22] and most recently, Bi-Weighted Oriented Optical Flow (Bi-WOOF) [21].

One of the earliest ME works that adopted convolutional neural network (CNN) is one by Patel et al. [24]. However, their method fared poorer than many conventional handcrafted descriptors due to the possibility of model overfitting. On the other hand, a recent work by Li et al. [16] finetuned a VGG-Face model with ME apex frames and achieved up to $\sim$ 63% in accuracy but at the expense of enormous trainable parameters (i.e., 138 Million).

Likewise, Wang et al. [29] adopted a CNN and Long Short-Term Memory (LSTM) architecture to learn the spatial-temporal information for each image sequence, which also comes with huge number of parameters (i.e. 80 Million). Prior to passing the image frames into the model, TIM is applied to each video sequence, to standardize the frame length to either 32 or 64. Besides, a three-stream CNN network is proposed by Li et al. [13] where each stream takes in a different type of data: grayscale, horizontal and vertical optical flow fields. Empirically, it performed as good as some recent methods ( $\sim$ 60%) [9, 21] for CASME II but ineffective in SMIC with a mere $\sim$ 55% accuracy.

To the best of our knowledge, [7] is the first work that performs cross-dataset validation on three distinct databases (i.e., CASME II [33], SMIC [15], SAMM [3]). Succinctly, they proposed a three-step framework:

Apex frame acquisition from each video;
Computation for optical flow guided features (i.e., horizontal and vertical optical flow images) from the apex and onset frames;
Feature learning and fusion using a neural network (coined as ‘OFF-ApexNet’). Hence, motivated by [7], this paper aims to improve the recognition performance by further simplification of the neural network while preserving sufficient capacity to learn the real structure of the ME details. The main contributions of this paper are as follows:

Proposal of a small and shallow 3-D convolutional neural network whilst preserving the effectiveness in generating rich and discriminative feature representation. 2. 2.

Feature extraction from three optical flow information (i.e., optical strain, horizontal and vertical optical flow). 3. 3.

Re-implementation of several state-of-the-art methods and baseline CNN architectures, and providing quantitative experimental analyses.

II Proposed Method

While many architectures proposed in the literature relied on increasing the number of neurons or increasing the number of layers to allow the network to learn more complex functions, this paper presents a shallow neural network architecture that comprises of two learnable layers. Similar to [7], the proposed micro-expression recognition scheme consists of three main steps, namely: apex frame spotting, optical flow features computation and feature learning with CNN. The overview of the recognition approach is illustrated in Fig. 1.

II-A Apex frame spotting

Firstly, the apex frame spotting stage is to identify the frame that contains the highest intensity of ME in a video sequence. Since SMIC database does not provide the ground-truth apex frame, we employ the D&C-RoIs [20] approach to obtain the apex frame index. D&C-RoIs has been utilized by several recent ME works [21, 7, 19, 6] as it facilitates in producing reasonably good performance for the purpose of ME recognition. With the LBP descriptor as feature choice, the method first computes the correlation between the first frame and the rest of the frames:

[TABLE]

where B is the number of bins in histograms $h_{1}$ (first frame), and $h_{2}$ ( $N-1$ other frames). The rate of difference ( $1-d$ ) of the LBP features are then compared among the three ROIs and the ROI with the highest rate of difference is selected. Finally, a divide-and-conquer strategy is applied to search for the frame with maximum facial muscle changes.

For clarity, we define some notations for ease of explanations. Within a collection of video data, the $i$ -th ME video sequence consists of $F_{i}$ number of frames:

[TABLE]

Note that each video contains only one onset (starting) frame $f_{i,1}$ , one offset (ending) frame $f_{i,F_{i}}$ and a single apex frame $f_{i,\alpha}\in[f_{i,1},f_{i,F_{i}}]$ . Note that the apex frame for SMIC is obtained via the D&C-RoIs spotting approach, while the other two databases have already provided apex frame annotations.

II-B Optical flow guided features

Next, we compute optical flow guided features using the onset and apex frames. The optical flow field that is computed from these two frames can be formulated as a tuple:

[TABLE]

where $X$ and $Y$ denote the width and height of the frame $f_{i,j}$ , respectively, while $u(x,y)$ and $v(x,y)$ represent the horizontal and vertical components of $O_{i}$ , respectively. Another optical flow derivative, known as optical strain, is capable of approximating the intensity of facial deformation and it can be defined as:

[TABLE]

where u = $[u,v]^{T}$ is the displacement vector. It can also be re-written as a Hessian matrix:

[TABLE]

where the diagonal terms, ( $\varepsilon_{xx},\varepsilon_{yy}$ ), are normal strain components and ( $\varepsilon_{xy},\varepsilon_{yx}$ ) are shear strain components. The optical strain magnitude for each pixel can be computed by taking the sum of squares of the normal and shear strain components, such that:

[TABLE]

Appending the optical strain to the optical flow field $O$ yields a triple, $\Theta=\{u,v,\epsilon\}\in\mathbb{R}^{3}$ . In summary, each video can be derived into the following three optical flow based representations:

•

$u$ - Horizontal component of the optical flow field $O_{i}$ ,

•

$v$ - Vertical component of the optical flow field $O_{i}$ ,

•

$\varepsilon$ - Optical strain

II-C Shallow triple stream 3D-CNN

The final step is to further learn optical flow guided features using a new shallow triple stream 3D-CNN. By virtue of being a 3D-CNN, all convolutional layers are in three dimensions. The input to the network is the optical flow cube $\Theta$ described in the previous sub-section. Following the suggestion in [7], the 3D input cube will be resampled to 28 $\times$ 28 $\times$ 3. Then, the image is passed through three parallel streams, each consists of a convolutional layer (each stream has a different number of kernels, i.e. 3, 5, 8) followed by a max pooling layer. This design supplements the small scale input data by utilizing different number of 3 $\times$ 3 kernels on each stream to avoid the problem of underfitting the data. In addition, the max pooling operation is used to highlight dominant features while eliminating redundancy. Next, the outputs are merged channel-wise to form a 3D block of features before applying an additional $2\times 2$ average pooling layer. Lastly, a 400-node fully connected (FC) layer provides further abstraction before the final softmax layer classifies to one of the three ME composite emotion classes.

The exact network configuration is shown in Table I. For all our experiments, we use a small learning rate of $5\times 10^{-5}$ with the maximum number of epochs set to 500.

III Experiment

III-A Databases

There are three databases commonly used for ME recognition: SMIC [15], CASME II [33] and SAMM [3, 2]. The detailed information of these three databases are shown in Table II. It is observed that the databases are largely limited on their own, and have an imbalanced distribution of samples per emotion. In the second Facial Micro-expression Grand Challenge, the majority of video samples from these three databases are merged into a composite database by mapping their individual emotion classes into three generic emotion classes: ‘Positive’, ‘Negative’ and ‘Surprise’. There are a total of 442 samples after merging the databases.

III-B Performance Metric

As the emotion classes are still imbalanced after merging (250 Negative, 109 Positive, 83 Surprise), two balanced metrics are used to reduce potential bias: Unweighted F1-score (UF1) and Unweighted Average Recall (UAR). Intuitively, both metrics are the averaged per-class computation of their respective original metrics:

[TABLE]

where $M$ is the number of classes; TP, FN and FP are the true positive, false negative and false positive, respectively.

The new balanced metrics are expressed as follows:

[TABLE]

All the results presented are evaluated based on leave-one-subject-out (LOSO) cross validation protocol.

IV Results and Discussion

We implemented a number of benchmark methods, i.e. LBP-TOP [34]) baseline, state-of-the-art methods of Bi-WOOF [21] and OFF-ApexNet [7]), and some popular deep learning architectures (AlexNet [12], Squeezenet [10], GoogLeNet [28], VGG16 [27]) in their original form, to compare against the proposed method. Table III reports our results. Full refers to the composite database. Table III shows that the proposed STSTNet outperforms other methods in all scenarios, except for OFF-ApexNet on CASME II. More importantly, it achieved the best UF1 (0.7353) and UAR (0.7605) on the full composite database. Overall, the STSTNet approach produced an average improvement of approximately 15%, 48%, 14%, 26% over the LBP-TOP baseline on the composite, SMIC, CASME II and SAMM databases respectively.

The confusion matrix in Table IV shows that the proposed method is capable at distinguishing the negative emotion, which is partly attributed to the fact that more than half of all videos belong to the Negative class. In SAMM, the STSTNet also performed very well on the Negative class ( $\sim$ 90%); the class imbalanced problem is even more severe in the case of SAMM. As expected, the high frame rate capture of the CASME II samples provides more precise apex frames, which in turn, leads to more accurate optical flow computation that better characterizes the motion changes. On the contrary, STSTNet exhibits lower recognition performance on the SMIC as compared to CASME II due two possible reasons. First, the addition of the spotting step (which has a reported MAE of $\sim$ 13 frames [20]) may introduce potential errors due to the inaccurately spotted apex frames. Besides, SMIC videos are captured at a lower frame rate (100 fps), and are affected by various background noises such as the shadows, highlights, illumination, flickering lights due to the database elicitation setup.

Table V summarizes some key properties of all competing neural networks mentioned in Table III, including: 1) Depth - the largest number of sequential convolutional or fully connected layers in an end-to-end neural network; 2) Learnable parameters - the number of weights and biases in the network; 3) Image input size - input image resolution; 4) Fold execution time - The total training and testing time for a single fold of LOSO cross validation evaluation. The STSTNet has the least network depth (2), learnable parameters size (1670 weights and biases) and a relatively low computational time ( $\sim~{}5.7s$ to train the model and infer test data). Our network models are implemented in MATLAB, and code is publicly available111https://github.com/christy1206/STSTNet for non-commercial, or research use.

V Conclusion

As a submission entry to the 2nd Micro-Expression Grand Challenge (MEGC), this paper presents a novel shallow triple stream three-dimensional CNN (STSTNet) to learn optical flow guided features for ME recognition. A compact and discriminative feature representation is learned from an input cube consisting of three optical flow images (i.e., horizontal optical flow, vertical optical flow and optical strain). Overall, the proposed STSTNet approach has demonstrated promising recognition results on a newly merged composite ME database consisting of three spontaneous ME databases, yielding a UF1 of 0.7353 and UAR of 0.7605 which surpassed recent state-of-the-art methods. In future, the apex spotting technique requires further improvement to extract more accurate apex frames for ME recognition. A number of recent works have begun to exploit the benefits of using the apex frame for recognition [21]. Furthermore, other measures of the optical flow field such as the magnitude and orientation can also be considered as input to the neural network.

Acknowledgements

This work was funded in part by Ministry of Science and Technology (MOST) (Grant Number: MOST107-2218-E-035-006-), MOHE Grant FRGS/1/2016/ICT02/MMU/02/2 Malaysia and Shanghai ’The Belt and Road’ Young Scholar Exchange Grant (17510740100).

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] B. Allaert, I. M. Bilasco, and C. Djeraba. Consistent optical flow maps for full and micro facial expression recognition. In VISIGRAPP (5: VISAPP) , pages 235–242, 2017.
2[2] A. Davison, W. Merghani, and M. Yap. Objective classes for micro-facial expression recognition. Journal of Imaging , 4(10):119, 2018.
3[3] A. K. Davison, C. Lansley, N. Costen, K. Tan, and M. H. Yap. SAMM: A spontaneous micro-facial movement dataset. IEEE Trans. on Aff. Computing , 2016.
4[4] P. Ekman. Telling lies: Clues to deceit in the marketplace, politics, and marriage (revised edition) . WW Norton & Company, 2009.
5[5] P. Ekman and W. V. Friesen. Constants across cultures in the face and emotion. Journal of Personality and Social Psychology , 17(2):124, 1971.
6[6] Y. Gan and S.-T. Liong. Bi-directional vectors from apex in cnn for micro-expression recognition. In 2018 IEEE 3rd ICIVC , pages 168–172, 2018.
7[7] Y. Gan, S.-T. Liong, W.-C. Yau, Y.-C. Huang, and L.-K. Tan. Off-apexnet on micro-expression recognition system. Signal Proc. Image Comm. , 2019.
8[8] S. L. Happy and A. Routray. Fuzzy histogram of optical flow orientations for micro-expression recognition. IEEE Trans. on Aff. Computing , 99, 2017.