MESTI-MEGANet: Micro-expression Spatio-Temporal Image and Micro-expression Gradient Attention Networks for Micro-expression Recognition
Luu Tu Nguyen, Vu Tram Anh Khuong, Thanh Ha Le, Thi Duyen Ngo

TL;DR
This paper introduces MESTI, a novel image modality, and MEGANet, an attention-based network, achieving state-of-the-art micro-expression recognition performance by effectively capturing subtle facial movements.
Contribution
The study presents MESTI and MEGANet, combining a new input modality with an attention network to significantly improve micro-expression recognition accuracy.
Findings
MESTI outperforms existing input modalities across CNN architectures.
Replacing inputs with MESTI improves existing MER networks.
MEGANet achieves state-of-the-art results on CASMEII and SAMM datasets.
Abstract
Micro-expression recognition (MER) is a challenging task due to the subtle and fleeting nature of micro-expressions. Traditional input modalities, such as Apex Frame, Optical Flow, and Dynamic Image, often fail to adequately capture these brief facial movements, resulting in suboptimal performance. In this study, we introduce the Micro-expression Spatio-Temporal Image (MESTI), a novel dynamic input modality that transforms a video sequence into a single image while preserving the essential characteristics of micro-movements. Additionally, we present the Micro-expression Gradient Attention Network (MEGANet), which incorporates a novel Gradient Attention block to enhance the extraction of fine-grained motion features from micro-expressions. By combining MESTI and MEGANet, we aim to establish a more effective approach to MER. Extensive experiments were conducted to evaluate the…
| Input modality for MER | VGG19 | ResNet50 | EfficientNet-B0 | ||||
| CASME II | SAMM | CASME II | SAMM | CASME II | SAMM | ||
| Static | |||||||
| Apex frame | 50.00% | 43.75% | 46.15% | 50.00% | 34.62% | 43.75% | |
| Dynamic | |||||||
| Optical flow (Onset–Apex) | 50.00% | 50.00% | 46.15% | 37.50% | 26.92% | 43.75% | |
| Optical flow (Apex–Offset) | 53.85% | 50.00% | 34.62% | 43.75% | 46.15% | 43.75% | |
| Dynamic image | 57.69% | 50.00% | 53.85% | 37.50% | 53.85% | 50.00% | |
| Affective motion image | 50.00% | 43.75% | 53.85% | 50.00% | 46.15% | 43.75% | |
| Active image | 48.00% | 57.14% | 44.00% | 50.00% | 52.00% | 48.00% | |
| MESTI (ours) | 73.08% | 62.50% | 65.38% | 56.25% | 61.54% | 50.00% | |
| Input | Network | CASME II | SAMM |
| ACC | ACC | ||
| Apex Image* | Micro-attention [wang2020micro] | 65.9% | 48.5% |
| MESTI | Micro-attention [wang2020micro] | 71.02% | 63.24% |
| Dynamic Image* | VGG19 [8844867] | 51.02% | 43.23% |
| MESTI | VGG19 [8844867] | 69.39% | 60.29% |
| Method | Year | CASME II | SMIC-HS | SAMM | ||||||
| UF1 | UAR | ACC | UF1 | UAR | ACC | UF1 | UAR | ACC | ||
| FeatRef [ZHOU2022108275] | 2022 | 0.892 | 0.887 | – | 0.701 | 0.708 | – | 0.737 | 0.716 | – |
| Dual-ATME [e25030460] | 2023 | 0.765 | 0.751 | 0.817 | 0.646 | 0.658 | 0.646 | 0.562 | 0.538 | 0.714 |
| FRL-DGT [zhai2023featurerepresentationlearningadaptive] | 2023 | 0.919 | 0.903 | – | 0.743 | 0.749 | – | 0.772 | 0.758 | – |
| SelfME [10204934] | 2023 | 0.908 | 0.929 | – | 0.697 | 0.701 | – | – | – | – |
| Micron-BERT [Nguyen2023MicronBERTBF] | 2023 | 0.903 | 0.891 | – | 0.855 | 0.838 | – | – | – | – |
| MERASTC [9363624] | 2023 | 0.933 | 0.950 | – | 0.790 | 0.862 | – | – | – | – |
| GLEFFN [10.1145/3607829.3616446] | 2023 | 0.883 | 0.911 | – | 0.771 | 0.786 | – | – | – | – |
| SODA4MER [soda4mer] | 2025 | 0.887 | 0.881 | – | 0.886 | 0.888 | – | – | – | – |
| OFVIG-Net [ofvig] | 2025 | 0.713 | 0.720 | – | 0.644 | 0.640 | – | 0.607 | 0.579 | – |
| MESTI-MEGANet | 2025 | 0.913 | 0.929 | 0.932 | 0.917 | 0.924 | 92.68 | 0.890 | 0.914 | 0.918 |
| Method | Year | CASME II | SAMM | ||||
| UF1 | UAR | ACC | UF1 | UAR | ACC | ||
| GEME [NIE202113] | 2021 | 0.735 | - | 75.20 | 0.454 | - | 55.88 |
| MER-Supcon [mersupcon] | 2022 | 0.729 | - | 73.58 | 0.625 | - | 67.65 |
| CMNet [wei2023cmnet] | 2023 | 0.740 | - | 78.05 | 0.772 | - | 78.68 |
| C3DBed [PAN2023106258] | 2023 | 0.752 | - | 77.64 | 0.722 | - | 75.73 |
| KPCANet [kpcanet] | 2023 | 0.659 | - | 70.46 | 0.522 | - | 63.83 |
| JGULF [WANG2024105091] | 2024 | 0.807 | - | 82.04 | 0.720 | - | 80.71 |
| AU GCN [10446702] | 2024 | 0.776 | - | 81.85 | 0.757 | - | 79.82 |
| SODA4MER [soda4mer] | 2025 | 0.814 | - | 84.18 | 0.789 | - | 80.30 |
| LRT3O [10981613] | 2025 | 0.791 | - | 81.78 | 0.757 | - | 80.15 |
| Micro_NesT [HE2025128372] | 2025 | 0.772 | - | 77.93 | 0.748 | - | 76.69 |
| MELLM [mellm] | 2025 | 0.485 | 0.534 | 64.34 | - | - | - |
| MESTI-MEGANet | 2025 | 0.779 | 0.786 | 82.04 | 0.791 | 0.803 | 80.88 |
| Method | Year | CASME II SMIC | SAMM SMIC | ||
| Acc | WF1 | Acc | WF1 | ||
| STCNN [8852419] | 2019 | 31.40 | 19.00 | 32.50 | 19.00 |
| CapsuleNet [10.1109/FG.2019.8756544] | 2019 | 32.20 | 15.20 | 32.40 | 17.90 |
| MER-GCN [9175230] | 2020 | 36.70 | 27.20 | 36.10 | 17.80 |
| AU-GACN [auasis] | 2020 | 34.40 | 31.90 | 45.10 | 30.90 |
| MOL [shao2025mol] | 2025 | 47.13 | 43.91 | 44.58 | 32.32 |
| Ours | 2025 | 50.00 | 46.81 | 46.95 | 40.89 |
| Gradient Attention Block | Residual Block | Self-attention | UF1 | UAR | ACC |
| – | ✓ | – | 0.788 | 0.821 | 80.49 |
| – | – | ✓ | 0.746 | 0.775 | 75.61 |
| – | ✓ | ✓ | 0.8304 | 0.8609 | 83.54 |
| ✓ | – | – | 0.8084 | 0.8551 | 81.10 |
| ✓ | ✓ | ✓ | 0.917 | 0.924 | 92.68 |
| Residual Attention Block | UF1 | UAR | ACC |
| 2 | 0.802 | 0.842 | 82.32 |
| 3 | 0.917 | 0.924 | 92.68 |
| 4 | 0.844 | 0.878 | 85.98 |
| Attention type | CASME II SMIC | SAMM SMIC | ||
| Acc | WF1 | Acc | WF1 | |
| None | 45.12 | 44.67 | 39.63 | 39.97 |
| SE [hu2018squeeze] | 44.51 | 42.10 | 43.90 | 40.74 |
| CBAM [woo2018cbam] | 45.73 | 43.15 | 41.46 | 39.28 |
| Gradient Attention | 50.00 | 46.81 | 46.95 | 40.89 |
| Input representation | CASME II SMIC | SAMM SMIC | ||
| Acc | WF1 | Acc | WF1 | |
| Apex frame | 35.98 | 30.04 | 39.63 | 37.36 |
| Dynamic image | 40.85 | 40.08 | 40.24 | 36.09 |
| MESTI-middle based | 48.17 | 42.50 | 46.34 | 40.83 |
| MESTI-apex based | 50.00 | 46.81 | 46.95 | 40.89 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsAdvanced Neural Network Applications · Human Pose and Action Recognition · Brain Tumor Detection and Classification
\tnotemark
[1,2]
\credit
Conceptualization of this study, Methodology, Software
\credit
Conceptualization of this study, Methodology, Software
\credit
Conceptualization of this study, Methodology, Software
\cormark
[1] \creditConceptualization of this study, Methodology, Software
1]organization=Faculty of Information Technology, VNU University of Engineering and Technology, addressline=144 Xuan Thuy, Cau Giay, Ha Noi, city=Hanoi, postcode=100000, country=Viet Nam \cortext[cor1]Corresponding author
Apex-Centered Spatio-Temporal Rank Pooling and Gradient Attention for Micro-Expression Recognition
Luu Tu Nguyen
Vu Tram Anh Khuong
Thanh Ha Le
Thi Duyen Ngo [email protected] [
Abstract
Micro-expression recognition (MER) is a challenging task due to the subtle and fleeting nature of micro-expressions. Traditional input modalities, such as Apex Frame, Optical Flow, and Dynamic Image, often fail to adequately capture these brief facial movements, resulting in suboptimal performance. In this study, we introduce the Micro-expression Spatio-Temporal Image (MESTI), a micro-expression-specific reformulation of dynamic rank pooling that transforms a video sequence into a single image while emphasizing the onset-apex-offset temporal pattern of micro-expressions. Additionally, we present the Micro-expression Gradient Attention Network (MEGANet), which incorporates a proposed Gradient Attention block to enhance the extraction of fine-grained motion features from micro-expressions. By combining MESTI and MEGANet, we aim to establish a more effective approach to MER. Extensive experiments were conducted to evaluate the effectiveness of MESTI, comparing it with existing input modalities across regular architectures. Moreover, we demonstrate that replacing the input of previously published MER networks with MESTI leads to consistent performance improvements. The performance of MEGANet is also evaluated, showing that our proposed network achieves state-of-the-art results on the SMIC-HS, SAMM and competitive performance on CASMEII datasets, it also achieves leading performance in the reported cross-dataset evaluation settings. The combination of MESTI and MEGANet consistently outperforms the compared methods. These findings underscore the potential of MESTI as a superior input modality and MEGANet as an advanced recognition network, aiming to more effective MER systems in a variety of applications.
keywords:
Micro-expression recognition \sepRank-pooling \sepGradient attention \sepMicro-expression input representation \sepMicro-expression recognition network
1 INTRODUCTION
Facial expression, a vital channel of non-verbal communication, encompasses two primary types: macro expressions and micro expressions. Macro expressions are typically deliberate, easily observable, and last for an extended period, conveying a person’s emotions openly [Ekman01051992]. In contrast, micro-expressions (MEs) are brief, involuntary facial movements that last less than 0.5 seconds [Matsumoto2011, Yan2013], making them significantly challenging to control or fabricate. Unlike macro expressions, MEs reveal a person’s genuine emotions, often surfacing when one attempts to conceal their true feelings [Ekman2003-eb]. These fleeting expressions are especially revealing in high-risk situations [Goh2020, 7851001], where concealing emotions is common. Since these unique characteristics of MEs, they have garnered significant attention as a channel for uncovering individuals’ genuine thoughts and emotions. Their involuntary nature provides valuable insights, making them highly applicable in a range of critical fields. For instance, MEs play a crucial role in enhancing the accuracy of deception detection systems, providing valuable insights that more prolonged expressions may not capture [Yildirim2023]. In criminal investigations, law enforcement officers can assess a suspect’s truthfulness by analyzing MEs that may contradict verbal statements [Frank2015]. Beyond security applications, MEs are increasingly relevant in healthcare, particularly in clinical settings, where they can provide essential clues about a patient’s emotional state and aid medical professionals in assessing recovery progress [Endres2009].
Despite its potential, ME recognition (MER) presents significant challenges due to the brevity and subtle intensity of MEs. Studies have shown that even experts achieve only 47% accuracy in recognizing MEs, highlighting the inherent complexity of this task [frank2009see]. However, leveraging advancements in computational capabilities, as well as modern machine learning and deep learning algorithms, computer-based systems for ME analysis have demonstrated significant superiority over human performance, with accuracy rates often exceeding 50%. These advancements offer a promising pathway for achieving more accurate and reliable recognition of MEs across a wide range of applications [9915437].
Early MER approaches primarily relied on handcrafted descriptors, such as texture-based [6130343], and optical-flow based methods [7286757, happy2017fuzzy, LIONG201882], to encode subtle facial motion. While these methods established important baselines, their representation capacity is limited and their performance often degrades under complex real-world conditions or in the cross database evaluation scenarios. With the success of deep learning, recent MER research has increasingly focused on learning-based approaches [9915437]. Existing deep MER methods can be broadly understood from two complementary perspectives: input representation and recognition architecture [9915437]. On the one hand, the input representation determines how the weak temporal dynamics of a micro-expression are exposed to the network. On the other hand, the recognition architecture determines whether the model can capture and amplify the subtle motion cues embedded in that representation.
However, current input modalities for MER still suffer from some limitations. Apex-frame-based methods [8451376, 10.1109/FG.2019.8756544] are compact and efficient, but they ignore temporal evolution and therefore miss the onset-to-offset dynamics of micro-expressions. Optical-flow based methods [gan2019off, happy2017fuzzy] provide explicit motion information, but they are often sensitive to noise and unstable when motion is subtle. Dynamic imaging based methods [bilen2016dynamic, 9194324, ACImage], which compress a sequence into a single image, are attractive because they provide compact spatio-temporal encoding, yet most existing formulations are inherited from general action recognition and are not explicitly tailored to the characteristic onset-apex-offset structure of micro-expressions. Consequently, they may fail to sufficiently emphasize the short discriminative phase around the apex.
At the architectural level, many recent MER networks have improved performance by using motion magnification [Kumar2019ClassificationOF], graph reasoning [lo2020mer], or transformer-style context modeling [9747232]. Although these advances are important, two issues remain insufficiently addressed. First, many methods still rely on input representations that are not specifically designed for the temporal morphology of micro-expressions. Second, existing attention mechanisms often emphasize semantic or regional importance at higher feature levels, but do not explicitly target the weak local intensity transitions that are crucial for MER at an early representation stage.
Motivated by these observations, we propose a new MER framework consisting of two complementary components: Micro-expression Spatio-Temporal Image (MESTI) and Micro-expression Gradient Attention Network (MEGANet). MESTI is a compact video-to-image representation designed specifically for MER. Instead of using a generic temporal ranking scheme, it introduces an apex-centered temporal encoding strategy that reflects the onset-apex-offset temporal modeling of micro-expressions, thereby concentrating representational emphasis around the most informative motion phase. MEGANet is a recognition network designed to work effectively with such subtle motion representations. It incorporates a Gradient Attention Block to highlight weak local intensity transitions and Residual Attention Blocks to further model spatial dependencies and refine discriminative facial patterns.
The proposed framework is motivated by an important principle for MER, an effective representation should not only compress the sequence, but should do so in a way that respects the unique temporal structure of micro-expressions, likewise, an effective network should not only model context, but should explicitly focus subtle motion cues that are easy to overlook.
The main contributions of this paper are summarized as follows:
- •
We propose MESTI, a MER-oriented spatio-temporal representation that reformulates sequence-to-image encoding using an apex-centered ranking principle aligned with the temporal dynamics of micro-expressions.
- •
We propose MEGANet: By integrating a novel Gradient Attention Block and Residual Attention Block, we develop a ME network capable of focusing on motion regions, thereby improving the performance of ME recognition.
- •
A comprehensive set of experimental scenarios is designed to validate the effectiveness of the proposed components, achieving performance that outperforms previous state-of-the-art studies.
Through extensive experiments, the effectiveness of each proposed component (MESTI, MEGANet) is demonstrated by evaluating their individual contributions and their combination with previously published methods. The results show that each component of our proposed method enhances the performance of the ME recognition process, and when combined, MESTI and MEGANet yield a effective overall MER approach.
2 RELATED WORKS
2.1 Handcrafted Methods for Micro-Expression Recognition
Early MER research was dominated by handcrafted feature design. A major line of work focused on texture-based descriptors, where local texture changes over space and time were encoded using methods such as LBP-TOP [6130343] and its variants [inter, 10.1007/978-3-319-16865-4_34]. These approaches were motivated by the observation that subtle facial movements can be reflected as local spatio-temporal texture variations. Subsequent extensions improved the descriptive ability of such features by incorporating different local operators, quantized local patterns [HUANG2016564]. These methods played an important role in the early development of MER because they offered interpretable motion descriptors and established initial benchmarks on public datasets.
Another major line of handcrafted MER methods relied on optical flow and related motion descriptors [7286757, of01, LIONG201882]. Since micro-expressions are fundamentally defined by subtle facial motion, optical flow provides a natural mechanism for capturing local direction and magnitude changes between frames. Methods based on main directional motion statistics, weighted optical flow, and fuzzy directional histograms have shown that motion-oriented handcrafted features are often more suitable for MER than purely static facial appearance features.
Despite their historical importance, handcrafted methods have intrinsic limitations. Their feature extraction process is manually designed and therefore has limited adaptability to the diversity and complexity of spontaneous micro-expressions. In addition, texture descriptors may fail to capture highly localized motion phases, while optical-flow based descriptors are often sensitive to noise, head movement, and illumination changes. As a result, their performance still limited, especially when compared with more recent deep learning approaches.
2.2 Input Representations in Deep MER
2.2.1 Apex frame and static image based representations
A straightforward way to simplify MER is to use the apex frame. Apex-based representations are computationally efficient and reduce the sequence modeling problem to a standard image classification setting. Several MER methods have demonstrated that the apex frame alone can contain useful discriminative information, especially when combined with powerful feature extractors or local-global fusion mechanisms [8451376, li2020joint].
However, the major limitation of apex-based input is that it removes most of the temporal evolution of the expression. Micro-expressions are not purely static events; they unfold through a short onset–apex–offset process. When only a single frame is used, the model loses the dynamic context needed to distinguish subtle motion patterns, especially when different classes exhibit similar apex appearances. Thus, apex-frame-based methods are compact, but they sacrifice temporal fidelity.
2.2.2 Optical flow based deep representations
To preserve motion information more explicitly, many deep MER methods adopt optical flow or onset-apex motion maps as input. These approaches have proven effective because they directly encode motion magnitude and direction, often providing stronger cues than raw appearance images. Representative methods such as OFF-ApexNet [gan2019off] and subsequent optical flow based networks showed that motion-driven inputs can significantly improve MER performance [Zeng2023, happy2017fuzzy].
Nevertheless, optical flow is not an optimal solution for MER. Because the movements involved in micro-expressions are extremely weak, the estimated flow field can be unstable and noisy. Small facial movements may be easily contaminated by illumination variation, compression artifacts, or slight head motion. In addition, multi-stream processing of horizontal and vertical flow components often increases model complexity. Therefore, although optical flow introduces temporal information, it may also introduce noise and computational overhead.
2.2.3 Dynamic imaging based representations
An appealing alternative is to summarize the entire video sequence into a single spatio-temporal image. Dynamic Image is a representative technique in this category and was originally developed for action recognition [bilen2016dynamic]. It encodes temporal evolution through rank pooling and produces a compact image-like representation that can be processed by conventional CNNs. This idea is attractive for MER because it balances compactness and temporal encoding.
Several MER studies have adopted or adapted this idea, leading to variants such as Affective Image [9194324] and Active Image [ACImage]. These methods demonstrate that sequence-to-image representations can be effective for MER, especially when paired with dedicated CNN architectures. However, most of these methods are inherited from generic dynamic summarization frameworks and are not explicitly designed around the characteristic temporal morphology of micro-expressions. In particular, micro-expressions are not simply temporally progressive events; their discriminative information is concentrated around a short apex-centered phase. A generic temporal compression strategy may therefore dilute or misrepresent the brief motion pattern that is most informative for MER.
This limitation motivates the need for a sequence-to-image representation that is specifically aligned with the onset–apex–offset structure of micro-expressions rather than with generic temporal ordering alone.
2.3 Deep MER Architectures
2.3.1 CNN-based and attention-based networks
With the rise of deep learning, CNN-based architectures became the dominant paradigm in MER. Early CNN-based methods mainly focused on extracting discriminative appearance features from apex frames or dynamic representations [liu2020offset]. Later studies incorporated attention mechanisms to guide the model toward more informative facial regions. For example, micro-attention [wang2020micro] and magnification-adaptive networks [9747232] attempted to improve MER by focusing on subtle motion-relevant areas. Other methods, such as LEARNet [8844867] and CMNet [wei2023cmnet], explored more specialized architectures for dynamic or contrastive MER learning.
These methods have significantly advanced the field. However, many of them place attention at intermediate or high-level feature stages and are primarily designed to improve region selection or context modeling. They do not necessarily provide an explicit mechanism for enhancing the weak local intensity transitions that characterize micro-expressions at the representation level. As a result, very subtle motion cues may still be underrepresented during early feature extraction.
2.3.2 Transformer and hybrid methods
More recently, MER has benefited from transformer-based and hybrid CNN-transformer models. Methods such as Micron-BERT [Nguyen2023MicronBERTBF] and other recent vision-transformer-based frameworks show that long-range spatial or temporal dependency modeling can be beneficial for MER [PAN2023106258]. Multi-scale attention frameworks and hybrid feature-fusion models have also demonstrated strong performance by combining local motion descriptors with global contextual modeling [HE2025128372].
2.4 Research Gaps
These methods confirm that richer dependency modeling is valuable. However, they still face the core challenge of MER: the signal itself is weak. Even a powerful global modeling framework may not fully solve the problem if the input representation does not expose the subtle motion clearly enough or if the network lacks an explicit mechanism to emphasize weak local transitions at an early stage.
The above review indicates that MER research has made substantial progress, yet two important gaps remain. First, input representation is still a bottleneck. Apex-based inputs are compact but discard temporal evolution; optical-flow-based inputs encode motion but are often noise-sensitive; and existing dynamic-imaging approaches provide compact sequence summarization but are mostly inherited from generic temporal pooling frameworks without explicitly modeling the onset–apex–offset dynamics of micro-expressions.
Second, network design is still not fully aligned with the weak-signal nature of MER. Existing architectures improve contextual reasoning through attention, graph modeling, magnification, or transformers, but many do not explicitly emphasize subtle local intensity transitions in a representation-aware manner.
These two gaps motivate the proposed framework in this paper. MESTI is designed to provide a compact spatio-temporal representation that is explicitly centered on micro-expression dynamics, while MEGANet is designed to amplify weak gradient-based motion cues and refine them through residual attention modeling. Together, they aim to address both the representation-level and architecture-level limitations of existing MER methods.
3 PROPOSED METHOD
3.1 Micro-expression Spatio-Temporal Image
The initial idea for creating an effective input representation for ME stemmed from observing and studying the motion characteristics of MEs. The intensity of motion gradually increases from the onset frame (the starting frame) to the apex (the frame with the highest ME intensity), then decreases towards the offset frame (the final frame representing the ME). Based on this characteristic, the proposed method simulates this motion in the process of constructing a distinctive representation for MEs, namely MESTI. Our objective is to create a spatio-temporal image that effectively represents a ME video. To achieve this, a temporal encoding approach introduced that transforms the entire video sequence into a single representative image. Additionally, our method incorporates the process of aggregating spatial information from the video into a compact static representation.
Inspired by the approximate rank pooling method, which has been used in modeling video evolution [Fernando_2015_CVPR], a similar strategy is proposed to encode the temporal evolution of MEs into a single image. This approach captures the dynamic variations in facial expressions over time while preserving the spatial structure necessary for effective ME recognition.
Different from conventional Dynamic Image construction, which is based on a temporally monotonic ranking principle, our objective is to reformulate the ranking relation according to the temporal morphology of micro-expressions. A typical micro-expression is not a uniformly progressive event; instead, its discriminative motion is concentrated around a short apex-centered interval, with intensity increasing from onset to apex and decreasing from apex to offset. Therefore, rather than assigning importance purely according to temporal order, MESTI ranks frames according to their proximity to the apex. This change is central: it shifts the inductive bias of rank pooling from generic temporal progression to apex-centered motion concentration, which is more suitable for MER.
3.1.1 Spatial Encoding
A video is represented as a sequence of consecutive frames, denoted as , where is the total number of frames, and represents the frame at time step . Let denote the feature vector extracted from each individual frame . In this study, is a vector that directly encodes the RGB components of each pixel in the frame .
Let be defined as a parameter vector responsible for assigning a score to each frame at time using a ranking function in Equation 1.
[TABLE]
The parameter is learned based on the entire frame sequence, ensuring that the scores assigned to each frame reflect their relative ranking. The learning process of is formulated as a convex optimization problem using RankSVM[smola2004tutorial], refers to the optimal parameter vector that is learned based on the entire frame sequence, as described in Equation 2.
[TABLE]
This process integrates spatial information from individual frames into a ME image that preserves structural and appearance details. By leveraging the extracted RGB feature vectors, the method ensures that spatial characteristics of each frame are considered in the ranking process, allowing the network to learn an optimal frame-ordering that reflects their relative importance in the sequence.
3.1.2 Temporal Encoding
Temporal encoding is performed based on the characteristic motion patterns of MEs, which serve as a basis for assigning scores to each frame during the rank pooling process of spatial encoding. Figure 1 illustrates the intensity of motion in ME. The motion characteristics of MEs can be easily observed: the intensity gradually increases from the first frame (onset), peaks at the apex frame, and then gradually decreases toward the final frame (offset) of the ME. Therefore, in this study, we aim to model the motion characteristics of MEs within the temporal encoding process to construct a ME image from the video.
Temporal encoding is implemented by generating a ranking score that simulates the motion intensity of the ME in a straightforward manner during the rank pooling process. Let defined as the apex frame, where the motion intensity of the ME reaches its maximum. Given any two frames , the frame closer to the apex frame is assigned a higher ranking score in our ranking function.
Thus, for any pair of frames such that , establishing the ranking score: . Accordingly, Equation 2 is further expanded as Equation 3:
[TABLE]
The first term in Equation 3 is the standard quadratic regularizer used in SVMs. The second term is a hinge-loss function that soft-counts how many pairs of frames are incorrectly ranked by the scoring function. To solve the equations involving Equation 1 and Equation 2, the ARP method [bilen2016dynamic] is used. Starting with , the first approximated solution obtained by gradient descent is:
[TABLE]
where:
[TABLE]
[TABLE]
can be expanded as follows:
[TABLE]
[TABLE]
[TABLE]
where is scalar coefficients. By expanding the sum:
When the action in the range of onset frame and apex frame ():
[TABLE]
[TABLE]
When the action in the range of apex frame and offset frame ():
[TABLE]
[TABLE]
Finally, the coefficient can be efficiently computed in two scenarios by aggregating the coefficients of along with their respective positive and negative signs:
[TABLE]
[TABLE]
Hence can be present as the rank pooling operator after using ARP calculation:
[TABLE]
Finally, the MESTI construction is approximated by multiplying the feature vector representing the RGB component of each frame at time with the coefficient provided in Equation 4.
The MESTI construction result is shown in Figure 2b using frame sequence (Figure 2a) as input. From the input frame sequence, we represent and observe the motion intensity of the ME representation and have the graph below. The MESTI construction results show that, firstly, our method generates a ranking function that better simulates the nature of the ME motion. Second, through the visual representation results, MESTI has shown more clearly the action units in the ME on the final image constructed compared to the traditional dynamic image method as shown in Figure 2c.
3.1.3 Theoretical properties of the proposed apex-centered ranking
Proposition 1. Under the apex-centered pairwise ranking relation
[TABLE]
the first-order approximate rank pooling solution can be written as
[TABLE]
where
[TABLE]
Therefore, the coefficient sequence is analytically induced by the proposed ranking relation rather than manually selected.
Proof.
Starting from the objective in Eq. (3), the first-order approximation of rank pooling gives
[TABLE]
By grouping terms with respect to each , the coefficients for frames in the onset-to-apex interval and the apex-to-offset interval can be collected separately. The resulting piecewise linear form is exactly Eq. (4). Hence, the coefficients are a direct consequence of the apex-centered ranking formulation.
Proposition 2. The coefficient sequence induced by MESTI is apex-centered and unimodal: it increases monotonically on and decreases monotonically on .
Proof.
For :
[TABLE]
which shows monotonic increase toward the apex.
For :
[TABLE]
which shows monotonic decrease after the apex. Therefore, the coefficient profile is unimodal and centered around the apex.
Proposition 2 clarifies why the proposed formulation differs from conventional Dynamic Image ranking. Instead of following global temporal order, MESTI concentrates representational emphasis around the apex, where discriminative ME motion is expected to be strongest. This property makes the representation better aligned with the onset–apex–offset nature of micro-expressions.
3.2 Micro-expression Gradient Attention Networks
The challenge in ME recognition lies in capturing the subtle, transient spatiotemporal patterns that characterize MEs, which often involve subtle intensity changes that conventional CNNs struggle to detect. These expressions are fleeting, making it difficult for traditional methods to effectively focus on the most critical regions of motion. To address this, MEGANet is proposed, a MER network that aims to enhance the detection of MEs by directing attention to areas with significant gradient changes. The core idea behind MEGANet is to combine gradient-guided attention with spatial self-attention, enabling the network to focus on both fine-grained motion transitions and the broader spatial context.
The proposed architecture consists of two key components as shown in Figure 3: the Gradient Attention Block and the Residual Attention Block. The Gradient Attention Block focuses on amplifying micro-intensity transitions by computing both horizontal and vertical gradients to identify regions with sharp intensity changes. This block generates an attention map through convolution and sigmoid activation, which is then multiplied with the input, enabling the network to prioritize areas with significant micro-movement. The Residual Attention Block, on the other hand, further refines the features by considering the spatial context, ensuring that important structural information is preserved during the feature extraction process. The overall network follows a structured pipeline comprising multiple processing layers:
- •
Input layer: The input to the network is an RGB image of size .
- •
Gradient Attention Block: Computes spatial gradients to enhance subtle ME features. A convolutional layer followed by a sigmoid activation generates an attention map, which is multiplied with the original input to highlight key regions.
- •
Convolutional Feature Extraction: A convolutional layer with 64 filters, followed by batch normalization, ReLU activation, and max pooling, extracts low-level spatial features from the input image.
- •
Residual-Attention Blocks: Three residual attention blocks process the feature maps hierarchically. Each block consists of two convolutional layers, batch normalization, ReLU activation, and a residual connection. A self-attention module is integrated to capture long-range spatial dependencies.
- •
Global Feature Aggregation: A global average pooling layer compresses the spatial feature maps into a compact feature vector, significantly reducing the number of parameters while retaining crucial information.
- •
Fully Connected Layer and Classification: The final feature vector is passed through a fully connected (FC) layer and a softmax activation function.
This architecture effectively captures ME dynamics by leveraging gradient-based attention and residual learning, improving the network’s ability to recognize subtle facial movements.
3.2.1 Gradient Attention Block
The motivation of gradient-based attention in MEGANet is closely related to the visual structure produced by MESTI. By construction, MESTI transforms a micro-expression video into a compact image in which subtle temporal motion is encoded as localized intensity transitions, especially around motion-relevant facial regions near the eyebrows, eyes, nose, and mouth. Therefore, the discriminative signal in MESTI is not only contained in global appearance, but also in weak local contrast changes and deformation boundaries. Since image gradients respond directly to such local intensity transitions, gradient-based attention is a natural choice for exploiting the motion-sensitive patterns exposed by MESTI. In this sense, the Gradient Attention Block serves as a representation-aware mechanism: it does not simply apply generic attention to the input, but specifically amplifies the subtle motion traces that MESTI is designed to highlight. This design makes MESTI and the Gradient Attention Block complementary: MESTI exposes subtle motion in image form, while gradient attention selectively enhances the most motion-informative regions within that representation.
This block, illustrated in Figure 4, explicitly models horizontal and vertical intensity gradients to localize ME regions. Given an input image , horizontal and vertical gradients at the spatial location are computed as:
[TABLE]
where and denotes zero-padded absolute differences. Combined gradient maps are generated through element-wise summation:
[TABLE]
The gradient map is then processed through a learnable 3x3 convolutional filter (), followed by sigmoid activation:
[TABLE]
The final output is obtained via element-wise multiplication:
[TABLE]
This attention map emphasizes regions with significant intensity transitions critical for ME analysis. Figure 5 illustrate the gradient attention map constructed from our proposed MESTI as input image and gradient attention block.
3.2.2 Residual - Attention Block
Our Residual - Attention Block is illustrated in Figure 6, building upon residual connection and SAGAN’s self-attention [zhang2019selfattentiongenerativeadversarialnetworks], this block aims to integrate self-attention into a residual framework to enhance spatial context modeling. Let denote the transformation by two convolutional layers:
[TABLE]
A shortcut connection handles dimension mismatches:
[TABLE]
The residual output becomes:
[TABLE]
Followed by Self-Attention Module proposed by SAGAN illustrated in Figure 7, specifically:
[TABLE]
Finally:
[TABLE]
4 EXPERIMENTS AND RESULTS
4.1 Experiment scenarios and objectives
To evaluate the effectiveness of the proposed method for MER, which includes the MESTI as input representation, the MEGANet as MER network, and the combined approach of MESTI and MEGANet, three experimental scenarios were conducted to assess the performance of each proposed component:
Experiment 01: This experiment aims to evaluate the effectiveness of the MESTI input representation. Specifically, it compares MESTI with other input modalities previously used in MER studies, such as Apex Frame, Optical Flow, Dynamic Image, Active Image, and Affective Image. Furthermore, the experiment continues by replacing the input in previously published MER networks with MESTI to investigate whether MESTI improves MER performance in these prior works.
Experiment 02: This experiment evaluates the performance of MEGANet in MER. And analysis the effective of key block proposed in MEGANet.
Experiment 03: This experiment assesses the overall effectiveness of the proposed MER method, combining MESTI and MEGANet. The results of this experiment are compared with recent SOTA methods to demonstrate the superiority of the proposed approach.
Experiment 04: This experiment assesses the generalization of the proposed MER method. The results of this experiment are compared with recent SOTA methods in cross-dataset protocol to demonstrate the superiority of the proposed approach.
4.2 Dataset & Data preprocessing
4.2.1 Datasets
The experiments are conducted on three publicly available ME recognition datasets, namely SMIC-HS[6553717], CASME II[Yan2014-fa] and SAMM[7492264], which are widely used as standard benchmarks for MER and for comparison with previous studies.
4.2.2 Data preprocessing
To ensure a fair comparison, the data preprocessing steps are minimalized, limiting them to face cropping, histogram equalization and resizing the images to dimensions appropriate for each network’s input requirements. This minimalistic approach eliminates potential biases from complex preprocessing techniques, allowing us to isolate and highlight the contributions of each input modality to overall network performance.
To ensure fair comparison with prior studies, this work conducts both 3-class and 5-class evaluations. In the 3-class evaluation, three common categories across the datasets are considered: positive, negative, and surprise. For the 5-class evaluation, the original emotion annotations provided in the CASME II and SAMM datasets are retained. Specifically, the CASME II dataset comprises the ME labels disgust (60 samples), happiness (33), other (102), repression (27), and surprise (25). The SAMM dataset includes the labels anger (57 samples), happiness (26), contempt (12), surprise (15), and other (49).
4.3 Experimental settings
4.3.1 Experiment 01
To ensure a fair comparison of the effectiveness of all input modalities, a common procedure is applied to all modalities. A train-test split protocol is used, with 90% for training and 10% for testing. The input is sequentially fed into three widely recognized deep learning networks: VGG19, ResNet50, and EfficientNetB0. This standardized approach minimizes external factors that could influence performance outcomes, allowing the observed differences to be directly attributed to the input modality itself.
MESTI is further used as an alternative input for the MER networks employed in two prior studies. To ensure fairness and the significance of the comparison results, we implement the experimental method in the same manner as described in their studies. Both studies used the Leave-One-Subject-Out (LOSO) protocol for evaluation.
4.3.2 Experiment 02
MEGANet is evaluated through an ablation study. Two ablation scenarios are conducted. In the first, individual components of MEGANet: the Gradient Attention Block and the Residual Attention Block are isolated and evaluated independently. In the second, the performance of MEGANet is assessed with respect to the varying number of Residual Attention Blocks to determine the most suitable configuration. Both ablations used the LOSO protocol for evaluation.
4.3.3 Experiment 03
This experiment is designed to evaluate the proposed method in this paper, using MESTI as the input and MEGANet as the MER network. The experiment is conducted using the LOSO protocol for evaluation to ensure a meaningful comparison with previously published methods.
4.3.4 Experiment 04
This experiment is designed to evaluate the proposed method in cross-dataset protocol, using MESTI as the input and MEGANet as the MER network. The experiment is conducted using the CASMEII/SAMM for trainning and testing on the SMIC dataset, evaluate on ACC and WF1 metric.
4.3.5 Specific configuration and training methodology
The following configuration and training methodology were used in this study:
- •
Data Augmentation: The dataset is augmented using horizontal flipping and rotations at 5° and 10° (both clockwise and counterclockwise).
- •
Loss Function: Focal Loss was used to address class imbalance and improve the network’s focus on hard-to-classify samples.
- •
Optimizer: The Adam optimizer was employed with a learning rate of and weight decay of to optimize the network, no learning-rate scheduler was used.
- •
Training Duration: The network was trained for 50 epochs.
- •
Metric: The primary evaluation metrics are UF1 and UAR; accuracy is reported as a supplementary measure, WF1 used in cross dataset evaluation follow other previous methods [shao2025mol].
4.4 Results
4.4.1 Visual representation
The visual results of MESTI and its corresponding Gradient Attention Map are shown in Figure 8 to observe how MESTI captures the characteristic features of each ME emotion type and how the Gradient Attention Map highlights the regions of interest within MESTI. A key observation is that MESTI effectively captures and highlights the defining motion patterns of MEs, making them perceptible to the human eye in a single image representation.
More specifically, both MESTI and the Gradient Attention Map successfully depict the characteristic Action Units corresponding to different ME emotions. For Disgust, the key motion regions primarily appear around the eyebrows, one side of the nose, and the corners of the mouth. Repression manifests as subtle downward movements on both sides of the mouth and the chin. Happiness is expressed by an upward motion at the corners of the mouth, whereas Surprise is predominantly reflected in eyebrow elevation and lower lip movement. These findings highlight the capability of MESTI to encode motion dynamics effectively in a compact and visually interpretable format.
4.4.2 MESTI Representation compared with other input modalities
Table 1 summarizes the comparative performance of various input modalities in the ME recognition task, evaluated using deep learning network on the CASMEII and SAMM datasets. The results consistently demonstrate that MESTI outperforms all other input modalities across the three widely used CNN architectures: VGG19, ResNet50, and EfficientNetB0. Specifically, MESTI achieves the highest accuracy of 73.08% on CASMEII and 62.5% on SAMM with VGG19, surpassing the second-best input modality (Dynamic Image) by 15.39% and 12.5%, respectively. This superior performance underscores MESTI’s effectiveness in capturing subtle motion features, which are crucial for ME recognition.
For the SAMM dataset, the overall recognition performance is lower compared to CASMEII across all input modalities, a trend consistent with previous studies due to SAMM’s greater diversity and complexity. Despite this challenge, MESTI continues to demonstrate superior recognition capabilities, achieving 62.5% with VGG19 and 56.25% with ResNet50, reinforcing its robustness across different datasets and deep learning architectures.
To further validate MESTI’s effectiveness, we investigated whether its superior performance was specific to our proposed pipeline or if it could enhance other established MER architectures. The original input modalities are replaced by two previously published works with MESTI: VGG19 (originally using Dynamic Image) and Micro-Attention (originally using Apex Frame). The results, presented in Table 2, show that for VGG19, replacing the input with MESTI improved recognition accuracy from 51.02% to 69.39% on CASMEII and from 43.23% to 60.29% on SAMM. Similarly, for Micro-Attention, using MESTI as input improved accuracy from 65.90% to 71.02% on CASMEII and from 48.5% to 63.24% on SAMM. These results confirm that MESTI not only enhances our proposed network but also significantly improves the performance of other MER architectures, demonstrating its capability to effectively represent ME dynamics in a single image.
4.4.3 Compared with State-of-the-art methods in MER
The comparative results with recent state-of-the-art methods are reported in Table 3 (for the 3-class evaluation), Table 4 (for the 5-class evaluation) and Table 5 (for cross-dataset evaluation). Overall, the proposed method outperforms existing SOTA approaches on the SAMM and SMIC-HS datasets and achieves competitive performance on the CASME II dataset, as reflected across all three evaluation metrics: accuracy, UF1, and UAR.
**3-class evaluation
**Overall, the proposed MESTI-MEGANet demonstrates strong and balanced performance across the three benchmark datasets, achieving the best results on SMIC-HS and SAMM, while remaining competitive on CASME II.
On SMIC-HS, the proposed method achieves the best performance across all three reported metrics, with UF1 = 0.917, UAR = 0.924, and ACC = 92.68%. Compared with the strongest competing method in this table, SODA4MER, which reports UF1 = 0.886 and UAR = 0.888, our method improves by +0.031 in UF1 and +0.036 in UAR. These results indicate that the proposed framework is particularly effective on SMIC-HS, even though this dataset does not provide apex-frame annotations.
On SAMM, MESTI-MEGANet again ranks first on all three metrics, reaching UF1 = 0.890, UAR = 0.914, and ACC = 0.918. Since several competing methods do not report results on SAMM, the comparison is limited to the available entries in Table 3; nevertheless, among the reported methods, our approach shows the strongest overall performance. In particular, it surpasses FRL-DGT (UF1 = 0.772, UAR = 0.758) and Dual-ATME (UF1 = 0.562, UAR = 0.538, ACC = 0.714) by a clear margin, demonstrating the effectiveness of the proposed representation and architecture on this challenging dataset.
On CASME II, the proposed method remains competitive, obtaining UF1 = 0.913, UAR = 0.929, and ACC = 0.932. The best performance on this dataset is achieved by MERASTC, with UF1 = 0.933 and UAR = 0.950, while FRL-DGT reports UF1 = 0.919 and UAR = 0.903. Therefore, although our method does not rank first on CASME II, the performance gap to the strongest reported methods remains small, indicating that the proposed framework generalizes well across different benchmark conditions.
Taken together, the results in Table 3 show that MESTI-MEGANet achieves state-of-the-art performance on SMIC-HS and SAMM in the 3-class setting, while maintaining highly competitive performance on CASME II. This suggests that the combination of MESTI, which provides an apex-centered motion representation, and MEGANet, which explicitly enhances subtle motion-sensitive regions, is effective across datasets with different characteristics and recording conditions.
**5-class evaluation
**On CASMEII, SODA4MER yields the best UF1 and ACC (UF1=0.814, ACC=84.18%). MESTI-MEGANet reaches UF1=0.779, UAR=0.786, ACC=82.04 and matches JGULF on accuracy (82.04), trailing SODA4MER by 2.14. On SAMM, MESTI-MEGANet attains the top results across all reported metrics (UF1=0.791, UAR=0.803, ACC=80.88). The gains are small but consistent: UF1 is slightly higher than SODA4MER (0.789), and accuracy exceeds JGULF (80.71) by 0.17 and SODA4MER (80.30) by 0.58. Note that UAR on SAMM is not commonly reported by most baselines, so direct UAR comparisons are limited.
Across protocols and datasets, MESTI-MEGANet delivers SOTA on SMIC-HS (3-class) and strong SOTA on SAMM (both 3-class and 5-class), while remaining competitive on CASMEII (second-best accuracy in 3-class; tied for accuracy but below the best UF1 in 5-class). These outcomes indicate that the method generalizes well to different datasets and label granularities, with the largest margins observed on SMIC-HS (dataset without apex frame annotated).
This success is attributed to two key factors:
- •
MESTI’s motion-specific encoding, which preserves spatiotemporal dynamics (Figure 2)
- •
MEGANet’s Gradient Attention mechanism, which focuses on intensity transitions (Figure 5) while Residual Attention blocks model long-range dependencies (Figure 6).
**Cross-dataset evaluation
**To further assess the robustness of the proposed method under domain shift, we additionally evaluate MESTI-MEGANet in a cross-dataset setting and compare it with representative state-of-the-art methods, as shown in Table 5. In this protocol, the model is trained on one dataset and tested on another, which is substantially more challenging than within-dataset LOSO evaluation due to differences in subjects, recording conditions, elicitation procedures, and data distributions. Following prior cross-dataset MER studies, ACC and WF1 are reported here to enable fair comparison with published baselines.
The results show that the proposed method achieves the best performance in all reported transfer settings. Specifically, when trained on CASME II and tested on SMIC, our method reaches 50.00% ACC and 46.81% WF1, outperforming the previous best method MOL (47.13% ACC, 43.91% WF1) by +2.87 and +2.90 percentage points, respectively. Similarly, in the more challenging SAMM → SMIC transfer, MESTI-MEGANet achieves 46.95% ACC and 40.89% WF1, again surpassing MOL (44.58% ACC, 32.32% WF1) by +2.37 points in accuracy and a larger margin of +8.57 points in WF1. These improvements are particularly meaningful because WF1 is more informative in imbalanced settings and indicates that the gain is not limited to majority-class prediction.
Overall, these cross-dataset results suggest that the proposed framework is not only effective under standard within-dataset evaluation, but also exhibits improved robustness when the training and testing distributions differ. We attribute this behavior to two complementary factors. First, MESTI provides a compact motion-centered representation that preserves discriminative micro-expression dynamics while reducing sensitivity to raw-frame appearance variation. Second, MEGANet, especially its Gradient Attention Block, helps focus the model on motion-sensitive facial patterns that are more likely to transfer across datasets than dataset-specific appearance cues. Although the absolute performance in cross-dataset evaluation remains lower than LOSO results, which is expected in MER, the consistent gains over prior methods indicate stronger generalization ability under domain shift.
4.4.4 The dependency of MESTI on apex frame
In the SMIC dataset, apex frame annotations are not provided; hence, the apex frame information cannot be directly utilized to construct MESTI. As an alternative, in this study we adopt a simple strategy of selecting the middle frame.
Interestingly, the results on SMIC demonstrate strong performance despite the absence of apex frame supervision. This finding suggests that the proposed MESTI approach can remain effective even without precise apex frame information, highlighting its robustness and applicability in more challenging scenarios where apex annotations are unavailable.
We continue to expand our experiments to analyze performance under noisy apex or apex-absent conditions, evaluated within a cross-dataset setting. This experiment serves two objectives: first, to compare the impact of using precisely labeled apex frames versus substituting them with middle frames; and second, to evaluate the effectiveness of different inputs to investigate MESTI’s performance in cross-dataset scenarios. The results are recorded in Table 9.
Table 9 provides two complementary observations. First, the performance of MESTI is indeed affected by the choice of apex location, confirming that apex-centered temporal encoding is relevant to the quality of the representation. However, the degradation remains relatively limited when the exact apex is replaced by an approximate one, indicating that MESTI is not overly sensitive to small or moderate inaccuracies in apex localization. In other words, while precise apex information is beneficial, the proposed representation does not collapse when the apex is noisy or unavailable.
Second, even under the apex-free setting, MESTI remains competitive with, and in several cases superior to, the other input modalities. This suggests that the advantage of MESTI does not rely solely on having perfectly annotated apex frames, but also comes from its ability to convert the facial motion sequence into an apex-centered spatio-temporal representation that still preserves discriminative motion structure under approximate temporal centering.
These results further support the practical robustness of the proposed method. In real-world MER scenarios, accurate apex annotations may be unavailable or difficult to estimate reliably. The findings in Table 9 suggest that MESTI can still provide an effective representation under such conditions, with only a limited performance drop relative to the exact-apex case. Therefore, the proposed formulation is not restricted to ideal benchmark settings, but remains applicable in more realistic apex-free or noisy-apex situations.
4.5 Ablation study
Ablation on network architecture. To evaluate the contribution of each component in MEGANet, we conducted an ablation study focusing on both the key building blocks and the number of Residual Attention Blocks.
As shown in Table 6, removing any of the major components leads to a clear performance drop. Using only the Residual Block without Gradient Attention or Self-attention yields the lowest performance (UF1 = 0.746, UAR = 0.775, ACC = 75.61). Incorporating Self-attention alone provides some improvement (ACC = 83.54), while Gradient Attention Block alone achieves ACC = 81.10. The best performance is obtained when all three modules are integrated, resulting in significant gains (UF1 = 0.917, UAR = 0.924, ACC = 92.68). This demonstrates that the Gradient Attention Block, Residual Block, and Self-attention contribute complementary benefits, and their combination is essential for maximizing recognition accuracy.
Table 7 further analyzes the effect of network depth by varying the number of Residual Attention Blocks. The results show that the performance improves markedly when the depth increases from two to three blocks (in term of ACC, UF1, UAR increase from 82.32/0.802/0.842 to 92.68/0.917/0.924), but decreases when a fourth block is added (85.98/0.844/0.878 in ACC, UF1, UAR). We interpret this behavior as a trade-off between representational capacity and generalization. With only two blocks, the network may not be sufficiently deep to progressively refine the subtle motion cues emphasized by the Gradient Attention Block and to model higher-level spatial dependencies. Three blocks provide a stronger hierarchical refinement process and yield the best overall performance. However, increasing the depth further to four blocks introduces additional model complexity, which is less suitable for MER due to the limited size and high subject variability of current datasets. In this setting, the additional block is more likely to introduce redundancy or overfitting than further discriminative benefit. Therefore, three Residual Attention Blocks provide the most effective balance between feature refinement and robustness in our experiments.
Ablation on attention mechanism. To further validate the effectiveness of the proposed Gradient Attention Block, we compare it with several representative attention alternatives, including no attention, SE, and CBAM, under the same cross-dataset evaluation setting. The results are summarized in Table 8.
The proposed Gradient Attention consistently achieves the best performance across both transfer scenarios. On CASME II SMIC, it obtains 50.00% ACC and 46.81% WF1, outperforming the no-attention baseline (45.12% ACC, 44.67% WF1), SE (44.51% ACC, 42.10% WF1), and CBAM (45.73% ACC, 43.15% WF1). On SAMM → SMIC, Gradient Attention again ranks first with 46.95% ACC and 40.89% WF1, compared with 39.63/39.97 for no attention, 43.90/40.74 for SE, and 41.46/39.28 for CBAM. These results indicate that the gain is not simply due to adding an arbitrary attention module, but is specifically related to the proposed gradient-driven formulation.
A explanation is that generic attention mechanisms such as SE and CBAM mainly reweight channels or spatial regions from feature activations, whereas the proposed Gradient Attention Block derives the attention cue directly from local intensity transitions. This is particularly suitable for MER, because the discriminative signal in MESTI is expressed as subtle motion-induced contrast changes rather than strong semantic structures. Therefore, Gradient Attention provides a more effective inductive bias for highlighting motion-relevant regions, especially in the cross-dataset setting where robustness to appearance variation is critical.
Another notable observation is that the advantage of Gradient Attention becomes more pronounced in the SAMM SMIC setting, where the domain gap is larger. This suggests that emphasizing gradient-based motion traces may help the model rely less on dataset-specific appearance patterns and more on transferable motion-sensitive cues. These findings support the claim that the proposed block is not merely a generic attention add-on, but a representation-aware mechanism that is well matched to the motion structure encoded by MESTI.
4.6 DISCUSSION AND LIMITATIONS
Class imbalance. MER datasets are inherently imbalanced, especially in the 5-class setting, as reflected by the label distributions of CASME II and SAMM reported in Section IV-B. In this work, class imbalance was handled at multiple levels. At the optimization level, we used Focal Loss to reduce the dominance of easy majority-class samples and encourage the model to focus more on difficult and minority-class examples. At the evaluation level, we emphasized UF1 and UAR in the main comparative and ablation experiments, since these metrics are more informative than accuracy for imbalanced MER datasets. In addition, moderate data augmentation was used to improve robustness and reduce overfitting to dominant classes.
Dependency on apex-frame estimation. MESTI uses apex information to define the center of its apex-oriented ranking formulation. However, the method is not strictly dependent on perfectly accurate apex detection. Its goal is to concentrate representational emphasis around the most informative phase of the micro-expression rather than to require exact frame-level precision. This property is partially supported by the SMIC-HS setting, where apex annotations are unavailable and a simple middle-frame approximation is used, yet the proposed framework still achieves strong performance. This suggests that MESTI can remain effective when the apex is unavailable or only approximately estimated. Nevertheless, large errors in apex localization may shift the temporal weighting away from the truly discriminative phase and thus weaken the resulting representation.
Failure cases and limitations. Despite its effectiveness, the proposed framework still has several limitations. First, MESTI compresses an entire video into a single image, which improves compactness but may lose some fine-grained temporal ordering information. Second, the quality of the representation may degrade when the estimated apex is substantially inaccurate. Third, strong head motion, motion blur, occlusion, or illumination variation may still interfere with the encoding of subtle facial motion, especially when the ME itself is extremely weak. Fourth, due to the small scale and high subject variability of current MER datasets, deeper or more complex architectures can easily overfit, which is also consistent with our observation that increasing the number of Residual Attention Blocks beyond three does not improve performance.
5 CONCLUSION
In this work, we address the limitations of existing MER methodologies by introducing ME Spatio-Temporal Image as a novel input modality and ME Gradient Attention Network as a novel architecture. MESTI effectively encodes micro-movements into a single image, preserving both spatial and temporal features, while MEGANet utilizes a Gradient Attention mechanism to enhance the detection of subtle motion cues.
Our experimental results validate the effectiveness of MESTI by showing that it outperforms all other input modalities, including Apex Frame, Optical Flow, and Dynamic Image, across multiple deep learning networks. Furthermore, replacing the input of previously published MER architectures with MESTI results in significant improvements in recognition accuracy, highlighting its broad applicability. Additionally, MEGANet achieves state-of-the-art performance, particularly when combined with MESTI, confirming its effectiveness in ME analysis. These findings establish MESTI and MEGANet as highly effective solutions for MER, significantly improving recognition accuracy. Future work could explore refining MESTI for real-time applications, integrating additional attention mechanisms, or leveraging larger-scale datasets to further advance ME recognition systems.
References
