Spotting Micro-Expressions on Long Videos Sequences
Jingting Li, Catherine Soladie, Renaud Sguier, Sujing Wang, and Moi Hoon Yap

TL;DR
This paper introduces baseline results for micro-expression spotting in long videos using local temporal patterns and PCA, evaluated on SAMM and CAS(ME)2 datasets, with a focus on defining true positives and F1-score metrics.
Contribution
It presents a novel baseline approach for micro-expression spotting using local temporal patterns and PCA, setting evaluation criteria and metrics for future research.
Findings
Baseline F1-score of 0.0316 on SAMM
Baseline F1-score of 0.0179 on CAS(ME)2
Evaluation framework with true positive criteria and F1-score metric
Abstract
This paper presents baseline results for the first Micro-Expression Spotting Challenge 2019 by evaluating local temporal pattern (LTP) on SAMM and CAS(ME)2. The proposed LTP patterns are extracted by applying PCA in a temporal window on several facial local regions. The micro-expression sequences are then spotted by a local classification of LTP and a global fusion. The performance is evaluated by Leave-One-Subject-Out cross validation. Furthermore, we define the criteria of determining true positives in one video by overlap rate and set the metric F1-score for spotting performance of the whole database. The F1-score of baseline results for SAMM and CAS(ME)2 are 0.0316 and 0.0179, respectively.
| Database | Participants | Samples | Resolution | FPS |
|---|---|---|---|---|
| SAMM | 32 | 79 | 20401088 | 200 |
| CAS(ME)2 | 22 | 97 | 640480 | 30 |
| Method | LTP-ML | LBP- | |||||
| database | SAMM | SAMM | CAS(ME) | CAS(ME)2 | SAMM | CAS(ME) | CAS(ME)2 |
| nb_vid | 79 | 79 | 32 | 97 | 79 | 32 | 97 |
| TP | 34 | 47 | 16 | 16 | 12 | 10 | 10 |
| FP | 1958 | 3891 | 1711 | 5742 | 4172 | 1729 | 5435 |
| FN | 125 | 112 | 41 | 41 | 147 | 47 | 47 |
| Precision | 0.0171 | 0.0043 | 0.0093 | 0.0028 | 0.0028 | 0.0057 | 0.0018 |
| Recall | 0.2138 | 0.2956 | 0.2807 | 0.2807 | 0.0755 | 0.1754 | 0.1754 |
| F1-score | 0.0316 | 0.0229 | 0.0179 | 0.0055 | 0.0055 | 0.0111 | 0.0035 |
| Database | ||||
|---|---|---|---|---|
| SAMM | 200 | 60 | 60 | 15 |
| CAS(ME)2 | 30 | 9 | 9 | 10 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsPrincipal Components Analysis
Spotting Micro-Expressions on Long Videos Sequences
Jingting Li1, Catherine Soladié 1, Renaud Séguier 1, Su-Jing Wang2 and Moi Hoon Yap3
1 CentraleSupélec, CNRS, IETR, UMR 6164, F-35000 Rennes, France
2 Key Laboratory of Behavior Sciences, Institute of Psychology, Chinese Academy of Sciences, Beijing, 100101, China
3 Manchester Metropolitan University, Manchester, M1 5GD, UK This work is supported by Chinese scholarship council and ANR reflet. This paper is also supported in part by grants from the National Natural Science Foundation of China (61772511) and The Royal Society (IF160006).
Abstract
This paper presents two methods for the first Micro-Expression Spotting Challenge 2019 by evaluating local temporal pattern (LTP) and local binary pattern (LBP) on two most recent databases, i.e. SAMM and CAS(ME)2. First we propose LTP-ML method as the baseline results for the challenge and then we compare the results with the LBP--distance method. The LTP patterns are extracted by applying PCA in a temporal window on several facial local regions. The micro-expression sequences are then spotted by a local classification of LTP and a global fusion. The LBP--distance method is to compare the feature difference by calculating distance of LBP in a time window, the facial movements are then detected with a threshold. The performance is evaluated by Leave-One-Subject-Out cross validation. The overlap frames are used to determine the True Positives and the metric F1-score is used to compare the spotting performance of the databases. The F1-score of LTP-ML result for SAMM and CAS(ME)2 are 0.0316 and 0.0179, respectively. The results show our proposed LTP-ML method outperformed LBP--distance method in terms of F1-score on both databases.
††publicationid: pubid: 978-1-7281-0089-0/19/$31.00 ©2019 IEEE
I INTRODUCTION
Facial micro-expression (ME) is a local brief facial movement, which can be triggered under high emotional pressure. The duration is less than 500ms [1]. It is a very important non-verbal communication clue, the involuntary nature make it possible to analyze personal genuine emotional state. ME analysis has many potential applications in national security [2], medical care [3], educational psychology [4], and political psychology [5]. Due to the growth and importance of MEs, researchers [6] have worked collaboratively to solicit the works in this area by conducting challenges in datasets and methods for MEs. This year, the theme of the Second Facial Micro-Expression Grand Challenge has extended to spotting challenges.
The main idea of most methods for ME spotting is to compare the feature differences between the first frame and the other frames in a time window. Meanwhile, the feature descriptors used in the state of the art are diverse, to name a few: LBP [7, 8], HOG [9], optical flow [10, 11, 12, 13], integral projection [14], Riesz pyramid [15], and frequency domain [16]. Feature differences allow consistent comparisons between frames over a time window of the size of an ME. However, the movements spotted between frames might not be the ME movements, it could be noises, macro-movements and illumination changes. This is why the ability to distinguish MEs from other movements (such as blinking or subtle head movements) remains an open challenge.
Nowadays, methods utilizing machine learning are emerging [17, 18, 19, 20]. Furthermore, [21] employed deep learning for the first time to perform the ME spotting. The machine learning process enhances the ability of distinguishing micro-expression from others. However, the spatial patterns are still the primary feature for the classifier. The temporal variation pattern of facial movement in a ME duration has yet to attract sufficient attention. Meanwhile, few articles spotted micro-expression directly from local region. However, the characteristic of that the micro-expression is a local facial movement could help to reduce the false positives.
In this paper, we spot the micro-expression clips in two recently published databases, and establish the baseline method for ME spotting challenge by using directly a temporal pattern extracted from local region [22]. Frames in a ME duration are taken into account to obtain a real temporal and local pattern (LTP), and then the LTPs are recognized by a classifier. Even though the spatial pattern is not studied, the spotted facial motions are differentiated by a fusion process from local to global. This method helps to improve the ability to distinguish ME from other movements. Furthermore, it allows finding the ME spatial local region and the temporal onset index of ME. We compare the results of our proposed LTP-ML method with a LBP approach - LBP--distance by Moilanen et al. [7].
The rest of the paper is organized as follows: Section II presents the methodology and performance metrics. Section III introduces the result and also shows the detailed experiment results. Section IV concludes the paper.
II Methodology
This section describes the benchmark databases, the proposed LTP-ML method, the state-of-the-art LBP method and the performance metrics.
II-A Databases
Two most recent long videos spontaneous micro-expression databases, SAMM [23] and CAS(ME)2 [24], are used for ME spotting challenge. Both databases contain long videos, which were recorded in the strictly controlled laboratory environment. Table I compares the differences between these two databases. The notable differences are the resolution and frame rates used in the experimental settings. These are indeed a great challenge for computer vision and machine learning community to produce a robust method worked for both databases, The detailed information of these two databases is presented in the following two subsections.
II-A1 SAMM Long Videos Database
SAMM database consists of a total of 32 subjects and each has 7 videos [23]. The average length of videos is 35.3s. The original release of SAMM consists of micro-movement clips labelled in Action Units. Recently, the authors [25] introduced objective classes and emotion classes for the database. The recognition challenge will be using the emotional classes from the database as ground truth. The spotting challenge focuses on 79 videos, each contains one/multiple micro-movements, with a total of 159 micro-movements. The index of onset, apex and offset frames of micro-movements were provided as the ground truth. The micro-movements interval is from onset frame to offset frame. In this database, all the micro-movements are labeled. Thus, the spotted frames can indicate not only ME but also other facial movements, such as eye blinks.
II-A2 CAS(ME)2 Database
In the part A of CAS(ME)2 database [24], there are 22 subjects and 97 long videos. The average duration is 148s. The facial movements are classified as macro- and micro-expressions. The video samples may contain multiple macro or micro facial expressions. The onset, apex, offset index for these expressions are given in the excel file. In addition, the eye blinks are labeled with onset and offset time.
II-B LTP-ML: Our Proposed Baseline Method
The baseline method is developed based on the proposed LTP-ML (local temporal pattern-machine learning) method in [22]. The method is extended for long videos by employing a sliding temporal window. The main idea and the modification of LTP-ML method is presented in the following paragraphs.
II-B1 Pre-processing
As the ME is a local facial movement, we analyze ME only on a selection of regions of interest (ROIs). First of all, as shown in Figure 1, 84 facial landmarks are tracked in the video sequence by utilizing the Genfacetracker (©Dynamixyz). Then the size of ROI square is determined by the distance between the left and right inner corners of eyes: . 12 ROIs squares are chosen based on the regions where ME happens most frequently, i.e. the corner of the eyebrows and of the mouth. Two ROIs of nose region are chosen as references because the nose is the most rigid facial region.
Since the average duration of ME is around 300ms, and the subjects barely moved in one second, the long videos in these two databases are processed by a temporal sliding window whose length is 1s. The overlap is set to 300ms to avoid missing any possible ME movements. This, the video is separated into an ensemble of small sequences by sliding temporal window as shown in Figure 2. The positions of 12 chosen ROIs for all frames in one sequence are determined by the detected landmarks of the first frame in the window.
II-B2 Feature Extraction
In this part, local temporal patterns (LTPs) [22] are analyzed in the local region to distinguish ME from other movements. They are extracted from 12 ROIs respectively in each small sequence. Supposing in sequence (), as illustrated in the lower part of Figure 2, PCA is performed on the temporal axis of each ROI sequence to conserve the principal variation at this region. The first two components of each ROI frame are used to analyze the variation pattern of local movement. The PCA process for ROI sequence () in can be presented as in equation 1.
{strip}
[TABLE]
where represents the pixels in one ROI frame, are the first two components of PCA, is the frame index in this ROI sequence (). Hence, each frame in can be represented by a point .
Then, a sliding window is set depending on the average duration of ME (300ms). The distances between the first frame and the other frames in this window are calculated. The window goes through each frame in the sequence , and the distance set can be got as , as shown in Figure 3.
The values of distance are then normalized for the entire to avoid the influence of different movement magnitude in different videos. Hence, the feature of frame n for can be represented as: , , where is the normalized distance value and the is the normalization coefficient. The more detailed deduction process can be found in [22]. The feature for one ROI sequence of the entire long video is the concatenation of features of all the separated sequences.
II-B3 Local Classification
As presented in the above paragraph, one video contains 12 feature ensembles from 12 ROI. Li et al. [22] showed the LTP patterns are similar for all chosen ROIs for all kinds of ME. The patterns which can represent the ME local movements can be recognized by a local classification. A supervised classification SVM is employed with Leave-One-Subject-Out cross validation. The feature selection and label annotation are presented in [22].
II-B4 Global Fusion
After the LTPs which fit the local ME movement pattern are recognized, a global fusion is processed to eliminate the false positives concerning other movements and true negatives caused by our recognition process. As introduced in [22], there are three steps: a local qualification, a spatial fusion and a merge process.
II-C LBP--distance Method
This method is firstly proposed in [7]. It is the most commonly used method for result comparison for ME spotting. Based on [7] and [18], the configuration of LBP- is set as follows: the entire face region is divided into 36 blocks. The overlap rates between blocks on axis X and Y are are 0.2 and 0.3 respectively. LBP features are extracted from blocks with uniform mapping. The radius r is set to , and the number of neighboring points p is set to . The distances of the each frame are computed in an interval.
First of all, the value of LBP--distance is compared in the whole long video. However, the method can barely spot any micro-expression intervals, while there are many false positives. This is due to this method spots the maximal movements in the video, and there are some larger movements than ME in both databases. Hence, the entire video is separated into a sub-video set by a sliding window, the setting is the same as the LTP-ML method. For each sub-video, the feature differences are calculated and sorted to find the maximal movement in this short interval. This gives the chance to spot more MEs which could be ignored in entire video comparison.
II-D Performance Metrics
There are three evaluation methods used to compare the performance of the spotting tasks:
1. True positive in one video definition Supposing there are micro-expressions in the video, and intervals are spotted. The result of this spotted interval is considered as true positive (TP) if it fits the following condition:
[TABLE]
where is set to 0.5, represents the micro-expression interval (onset-offset). Otherwise, the spotted interval is regarded as false positive (FP).
2. Result evaluation in one video Supposing the number of TP in one video is ( and ), then FP = , false positive (FN) = , the Recall, Precision and F1-score are defined:
[TABLE]
[TABLE]
In practical, these metrics might not be suitable for some videos, as there exist the following situations on a single video:
- •
The test video does not have micro-expression sequences, thus, , the denominator of recall will be zeros.
- •
The spotting method does not spot any intervals. The denominator of precision will be zeros since .
- •
If there are two spotting methods, Method1 spots p intervals and Method2 spots q intervals, and . Supposing for both methods, the number of true positive is 0, thus the metrics (recall, precision or F1-score) values both equal to zeros. However, in fact, the Method1 performs better than Method2.
Considering these situations, we propose for a single video, we record the result in terms of TP, FP and FN. For performance comparison, we produce a final calculation of other metrics for the entire database.
3. Evaluation for entire database Supposing in the entire database, there are videos and micro-expression sequences, and the method spot intervals in total. The database could be considered as one long video, thus, the metrics for entire database can be calculated by:
[TABLE]
[TABLE]
[TABLE]
The final results by different methods would be evaluated by F1-score since it considers the both recall and precision.
III Results and Discussion
As introduced in Section II, SAMM and CAS(ME)2 have different frame rates and resolution. Hence, the lengths of sliding window , the overlap size, the interval length of and the ROIs size are different for these two databases. Table III lists the experimental parameters.
For CAS(ME)2 database, there are 97 videos, but only 32 videos contain micro-expressions. Thus, different results are given under two conditions: one is only considering 32 videos which have ME (CAS(ME)), another one is to include the entire database (all 97 videos). Since the raw videos in SAMM database are too big to download (700GB), only 79 videos (full frame: 270GB and cropped face: 11GB) were provided for the challenge. In this work, we report the results based on these two versions of SAMM database: one is the cropped videos (SAMM) provided by the authors using the method in [26], and the other one is the videos with full frame (SAMM). The spotting process is performed only on the downloaded databases.
III-A Experiments Results of LTP-ML Method
After performing the LTP-ML method on these two databases, the spotting results for whole database are listed in Table II. The F1-score for (SAMM) and CAS(ME) are 0.0316 and 0.0179 respectively. LTP-ML performs better in SAMM than SAMM, since the cropped-face process has already aligned the face region in the video, and reduced the influence of irrelevant movements. Concerning the spotting result of CAS(ME)2, there are more FPs because the video in this database which has no ME may contain macro-expressions.
III-B Experiments Results of LBP--distance (LBP-) Method
The result is compared with LBP--distance (LBP-) method. The spotting result is listed in Table II. For CAS(ME), when the threshold for peak selection is set to 0.15, we can get the best result for LBP- method, the F1-score is 0.0111. Meanwhile, the highest F1-score of SAMM is 0.0055 when the threshold is set to 0.05.
Compared with LTP-ML method, LBP- method is less accurate. LTP-ML method is capable of spotting the subtle movements based on the patterns which represented the temporal pattern variation of ME. Yet, the value of F1-score is low because of the large amounts of FP. Both databases contain noises and irrelevant facial movements, especially for CAS(ME)2, it is not easy to separate macro-expressions from micro-expressions based on 30fps videos. The ability of distinguishing ME from other movements still need to be enhanced.
IV CONCLUSIONS
This paper addresses the challenge in spotting ME on long videos sequences using two most recent databases, i.e. SAMM and CAS(ME)2. We proposed LTP-ML for spotting MEs and provided a set of performance metrics as the guideline for result evaluation on ME spotting. The baseline results of these two databases are provided in this paper. We demonstrate that our proposed method is better than the LBP approach in spotting MEs. Whilst the method was able to produce a reasonable amount of TPs, there are still a huge challenge lays ahead due to the large amount of FPs. Further research will focus on enhancing the ability of distinguishing ME from other facial movements to reduce FPs, including the implementation of deep learning approaches when we have sufficient data.
V ACKNOWLEDGMENTS
The authors gratefully acknowledge the contribution of the Organisers and Program Committee Members.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] P. Ekman and W. V. Friesen, “Nonverbal leakage and clues to deception,” Psychiatry , vol. 32, no. 1, p. 88–106, 1969.
- 2[2] P. Ekman, “Lie catching and microexpressions,” The philosophy of deception , p. 118–133, 2009.
- 3[3] J. Endres and A. Laidlaw, “Micro-expression recognition training in medical students: a pilot study,” BMC medical education , vol. 9, no. 1, p. 47, 2009.
- 4[4] M.-H. Chiu, H. L. Liaw, Y.-R. Yu, and C.-C. Chou, “Facial micro-expression states as an indicator for conceptual change in students’ understanding of air pressure and boiling points,” British Journal of Educational Technology .
- 5[5] P. A. Stewart, B. M. Waller, and J. N. Schubert, “Presidential speechmaking style: Emotional response to micro-expressions of facial affect,” Motivation and Emotion , vol. 33, no. 2, p. 125, 2009.
- 6[6] M. H. Yap, J. See, X. Hong, and S.-J. Wang, “Facial micro-expressions grand challenge 2018 summary,” in Automatic Face & Gesture Recognition (FG 2018), 2018 13th IEEE International Conference on . IEEE, 2018, pp. 675–678.
- 7[7] A. Moilanen, G. Zhao, and M. Pietikäinen, “Spotting rapid facial movements from videos using appearance-based feature difference analysis,” in Pattern Recognition (ICPR), 2014 22nd International Conference on . IEEE, 2014, p. 1722–1727.
- 8[8] X. Li, X. Hong, A. Moilanen, X. Huang, T. Pfister, G. Zhao, and M. Pietikäinen, “Towards reading hidden emotions: A comparative study of spontaneous micro-expression spotting and recognition methods,” IEEE Transactions on Affective Computing , 2017.
