EEG–fNIRS Cross-Subject Emotion Recognition Based on Attention Graph Isomorphism Network and Contrastive Learning
Bingzhen Yu, Xueying Zhang, Guijun Chen

TL;DR
This paper introduces a new method for emotion recognition using EEG and fNIRS data that improves accuracy and generalization across subjects.
Contribution
The novel DC-AGIN model combines attention graph isomorphism networks with contrastive learning to enhance cross-subject emotion recognition.
Findings
DC-AGIN achieves 96.98% accuracy in subject-dependent four-class emotion classification.
The model reaches 62.56% accuracy under subject-independent leave-one-subject-out validation.
DC-AGIN outperforms existing models in cross-subject emotion recognition tasks.
Abstract
Background/Objectives: Electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS) can objectively capture the spatiotemporal dynamics of brain activity during affective cognition, and their combination is promising for improving emotion recognition. However, multi-modal cross-subject emotion recognition remains challenging due to heterogeneous signal characteristics that hinder effective fusion and substantial inter-subject variability that degrades generalization to unseen subjects. Methods: To address these issues, this paper proposes DC-AGIN, a dual-contrastive learning attention graph isomorphism network for EEG–fNIRS emotion recognition. DC-AGIN employs an attention-weighted AGIN encoder to adaptively emphasize informative brain-region topology while suppressing redundant connectivity noise. For cross-modal fusion, a cross-modal contrastive learning module…
Genes, proteins, chemicals, diseases, species, mutations and cell lines named across the full text — each resolved to its canonical identifier and authoritative record.
Click any figure to enlarge with its caption.
Figure 6- —National Natural Science Foundation of China
- —Fundamental Research Program of Shanxi Province, China
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsEmotion and Mood Recognition · Optical Imaging and Spectroscopy Techniques · EEG and Brain-Computer Interfaces
1. Introduction
Emotion is a core component of human cognition and decision making. It not only shapes individuals’ subjective well-being and social interactions but also deeply participates in higher-order cognitive processes such as attention, memory, and value-based judgment [1]. With the rapid development of affective computing [2] and affective brain–computer interfaces [3], accurately recognizing an individual’s affective state in complex real-world scenarios has become a fundamental problem for intelligent human–computer interaction and mental-health monitoring. Traditional emotion recognition methods are often built on non-physiological cues—such as facial expressions, speech prosody, behavior, and text—which are explicitly observable but inherently subjective and vulnerable to intentional masking. In contrast, physiology-based emotion recognition relies on autonomous or involuntary physiological responses, including electrocardiography (ECG), galvanic skin response (GSR), and central nervous system signals such as EEG and fNIRS [4,5,6]. These signals can partially bypass subjective concealment and more objectively reflect changes in emotional arousal and valence, making them particularly suitable for reliability-critical applications such as mental-health assessment and brain-computer interfaces [7].
Early studies on physiological-signal-based emotion recognition predominantly focused on a single modality. Among various physiological signals, EEG provides millisecond-level temporal resolution and has been widely used to characterize transient neural oscillatory activity under affect elicitation. Complementarily, fNIRS measures concentration changes in oxy-hemoglobin (HbO) and deoxy-hemoglobin (HbR) to indirectly reflect local cortical blood flow and metabolic activity; although its temporal resolution is lower, it offers superior spatial localization. Under the mechanism of neurovascular coupling, EEG and fNIRS provide two complementary observation views of the same underlying neural activity, and both can be acquired using portable wearable devices [8]. Therefore, compared with either modality alone, EEG–fNIRS multi-modal fusion is expected to achieve substantial gains in accuracy and robustness in emotion recognition.
In recent years, graph neural networks (GNNs) have emerged as a class of deep learning methods for non-Euclidean data, learning representations on graph-structured inputs primarily via message passing and neighborhood aggregation, and have been increasingly applied to emotion recognition and related tasks [9]. EEG and fNIRS can be naturally represented as graphs by treating measurement channels as nodes and functional connectivity as edges, enabling the mining of emotion-related network features at the node, edge, and graph levels, which also improves interpretability [10]. Jia et al. [11] proposed a channel-relation graph convolutional network (CR-GCN) and achieved classification accuracy rates of 94.69% and 93.95% for valence and arousal on the DEAP dataset using EEG. Li et al. [12] introduced an attention-based temporal graph representation network (ATGRNet) and obtained 92.59% accuracy on the SEED dataset with EEG signals. For EEG–fNIRS fusion-based emotion recognition, Chen et al. [13] developed a model combining GCNs with a capsule-attention network (GCN-CA-CapsNet) and reported 97.91% accuracy, while Hou et al. [14] constructed causal-coupled brain networks in source space and employed SVM classification, achieving 96.6% accuracy.
Despite the impressive performance reported under controlled laboratory settings, cross-subject emotion recognition for practical deployment still faces substantial challenges. First, inter-individual variability is a primary bottleneck for generalization. Due to inherent differences in anatomical structures, baseline physiological states, and affective response patterns, models trained with conventional cross-entropy losses often exhibit significant performance degradation on unseen subjects. Many high-accuracy approaches rely on subject-dependent training paradigms, with which it is difficult to satisfy the plug-and-play requirement of real-world BCIs. For example, Chang et al. [15] proposed a multi-scale hyperbolic contrastive learning network (MSHCL) and achieved 89.3% cross-subject accuracy on the SEED dataset. Si et al. [16] designed a dual-branch joint network (DBJNet) and reached 74.8% cross-subject three-class accuracy on a self-built fNIRS database. These studies suggest that cross-subject performance remains markedly lower than subject-dependent performance, largely because EEG feature distributions exhibit substantial domain shift across subjects, making it difficult for classifiers to learn universal affective representations [17]. Second, semantic alignment between heterogeneous modalities is still insufficient. Although EEG and fNIRS are physiologically related, their feature distributions lie on fundamentally different manifolds. Existing multi-modal fusion methods are often limited to data-level concatenation or shallow feature-level fusion; such straightforward combinations typically fail to eliminate cross-modal heterogeneity, leading to inefficient exploitation of complementary information [18]. Third, conventional GNNs often adopt isotropic aggregation when integrating neighborhood information and lack attention mechanisms that dynamically adjust inter-regional connection weights for emotion-related tasks, thereby limiting the ability to capture key affective brain networks [19]. Inspired by recent advances in computer vision and graph representation learning [20,21,22,23], we introduce contrastive learning into EEG–fNIRS fusion-based emotion recognition to better represent and relate the two signals.
Accordingly, we propose a dual-contrastive cross-modal attentional graph isomorphism network (DC-AGIN), which aims to achieve deep disentanglement and alignment of representations via two complementary contrastive learning mechanisms: cross-modal contrastive learning and cross-subject supervised contrastive learning. Architecturally, an attention graph isomorphism network (AGIN) is employed as the encoder to enhance the model’s capability to focus on critical brain network topological connections. From the learning-strategy perspective, cross-modal contrastive learning maps heterogeneous EEG and fNIRS signals into a shared semantic subspace to maximize mutual information across modalities; meanwhile, cross-subject supervised contrastive learning explicitly suppresses subject-specific identity information and encourages the encoder to concentrate on intrinsic emotion-class characteristics, thereby substantially improving generalization to unseen subjects. Our main contributions are summarized as follows:
- 1.We integrate a graph attention mechanism into the graph isomorphism network to improve neighborhood aggregation by dynamically assigning weights according to the importance of each neighboring node.
- 2.We leverage cross-modal contrastive learning to align the semantic information between EEG and fNIRS, thereby mitigating cross-modal heterogeneity. In addition, cross-subject contrastive learning is employed on the fused EEG–fNIRS representations to facilitate cross-subject emotion recognition and reduce inter-subject discrepancies in affective patterns.
- 3.Extensive experiments on our self-collected EEG and fNIRS dataset demonstrate that the proposed method achieves state-of-the-art performance.
2. Data and Pre-Built Model
2.1. Data Acquisition and Preprocessing
2.1.1. Subjects and Data Collection
A total of 30 healthy participants (15 males and 15 females) were recruited, with a mean age of years. All participants had normal (or corrected-to-normal) vision and normal hearing, were right-handed, and had no history of neurological or psychiatric disorders. Before each experiment, participants were fully informed of the study objectives and procedures. Written informed consent was obtained from all participants, and each participant received monetary compensation after completing the experiment. The study protocol complied with the Declaration of Helsinki and was approved by the Research Ethics Committee of Taiyuan University of Technology. For data acquisition, whole-brain EEG signals were recorded using an ESI-NeuroScan system with 62 channels at a sampling rate of 1000 Hz. Simultaneously, fNIRS signals were synchronously collected using a portable NirsSmart fNIRS system at a sampling rate of 11 Hz. The optode layout covered the frontal and temporal regions, comprising 18 measurement channels.
2.1.2. Experimental Paradigm
All experimental data were collected in an electromagnetically shielded room, and the acquisition workflow is illustrated in Figure 1a. During the experiment, participants sat comfortably in front of a monitor and were instructed to remain as still as possible, minimizing body movements and eye blinks to reduce artifact contamination. A video-elicitation paradigm was adopted. In total, 60 video clips (1–2 min each) were selected as affective stimuli, covering four emotional states: sadness, happiness, neutral, and fear (15 clips per category). The video clips were pre-evaluated in terms of arousal and valence based on the Self-Assessment Manikin (SAM) scores provided by 20 individuals who did not participate in the main experiment, and the final affective video stimuli were selected according to these ratings. After each clip, participants were required to complete a self-assessment questionnaire within 30 s to verify whether the target emotion was successfully elicited. Each participant completed 60 experimental trials. After the experiment, the system recorded the corresponding 62-channel EEG data and 18-channel fNIRS data; the spatial distribution of measurement channels is shown in Figure 1b.
2.1.3. Data Preprocessing
Given that raw EEG signals are susceptible to contamination from electrooculography (EOG), electromyography (EMG), and various noise sources, EEG preprocessing was performed using the EEGLAB v2023.0 toolbox. Specifically, the bilateral mastoids (M1 and M2) were used as re-referencing electrodes. A Hamming-windowed sinc FIR filter was applied for band-pass filtering in the range of 0.5–45 Hz. To remove signal drift, baseline correction was performed by subtracting the mean value within the 2 s pre-stimulus interval. Independent component analysis (ICA) was then employed to remove blink/ocular and myogenic artifact components; blink/ocular components typically exhibit prominent frontal weights on the scalp topography or abrupt large-amplitude fluctuations in the time course, whereas EMG artifacts are often characterized by elevated high-frequency energy in the waveform. These operations were used solely to denoise the signals and did not involve discarding any data segments. The EEG data were subsequently downsampled to 200 Hz and segmented into 1 s samples. For fNIRS signals, a 0.01–0.2 Hz band-pass filter was applied to suppress physiological drift and high-frequency noise, and the same 2 s baseline correction was performed. For fNIRS signals, a 2 s baseline correction was first conducted, and a 0.01–0.2 Hz band-pass filter was applied to suppress physiological drift and high-frequency noise. Subsequently, hemodynamic analysis was performed using NirSpark v2.1 software, where changes in optical density were converted into relative concentration changes in HbO and HbR according to the modified Beer–Lambert law [24]. The preprocessed fNIRS data retained the original sampling rate of 11 Hz and were likewise segmented into 1 s epochs aligned with the EEG epochs.
2.1.4. Feature Extraction and Graph Construction
After EEG preprocessing, the signals were decomposed into five canonical frequency bands: (0.5–4 Hz), (4–8 Hz), (8–13 Hz), (13–30 Hz), and (30–45 Hz). For each band, the differential entropy (DE) feature was computed [25], which can be expressed as
here x denotes the EEG signal amplitude of a given channel within a specific frequency band and time window, which is assumed to follow a Gaussian distribution . Therefore, for each EEG channel, a 5-dimensional feature vector was obtained, corresponding to the energy-related DE features of the five frequency bands.
The fNIRS signals primarily reflect the temporal evolution of cerebral hemodynamics. To capture the dynamic characteristics of HbO and HbR concentrations during affect elicitation and to keep the feature dimensionality consistent with the EEG representation for subsequent unified node processing, we extracted five statistical features (mean, variance, skewness, maximum, and minimum) from the HbO and HbR time series of each fNIRS channel. These features characterize both the statistical properties and variation trends of fNIRS signals and are defined as
Here, denotes the i-th fNIRS sample, and N is the sample length; the skewness s reflects the asymmetry of the data distribution. Finally, each EEG electrode and each fNIRS channel are represented as a feature vector in .
2.1.5. Graph Construction
For EEG signals, we define the 62 EEG electrodes as a node set with . We compute the correlations among all EEG channel signals using the Pearson correlation coefficient (PCC) [26] in a subject-wise manner to avoid any training–test leakage. Specifically, for each subject, the PCC is estimated using only the trials from that subject, yielding a subject-specific adjacency matrix . The PCC measures the linear association between two EEG channel signals, i.e., it quantifies how consistently the amplitudes of two channel-wise feature vectors co-vary. The magnitude indicates the strength of the correlation: values closer to 1 correspond to stronger linear dependence, whereas indicates that the two vectors are linearly uncorrelated. According to research by Koszut et al. [27], we then define connections with as edges to retain only sufficiently correlated channel pairs and suppress weak/noisy couplings. In this way, an undirected EEG graph is constructed as , together with the node-feature matrix and the subject-specific adjacency matrix , where each node feature corresponds to the DE feature computed above. Here, 4845 denotes the total number of samples obtained after segmenting the signals into 1 s windows.
Here, denotes the covariance, and are the standard deviations of the two channel signals. According to Achard et al. [26], we retain, for each node, the top 20% strongest connections when building the graph.
For fNIRS signals, we perform graph-level fusion of HbO and HbR. Specifically, the 18-channel HbO signals and the 18-channel HbR signals are jointly treated as graph nodes, yielding a node set with . Following the same procedure as for EEG, we compute PCC-based correlations to construct an fNIRS feature matrix and an adjacency matrix .
Figure 2 presents the adjacency matrices of the constructed graphs for the two modalities. In summary, constructing modality-specific graphs enables subsequent GNNs to mine comprehensive cortical electrophysiological and neurovascular characteristics within a unified topological space.
3. Methods
3.1. Graph Isomorphism Network Based on Graph Attention
Among many GNN variants, the graph isomorphism network (GIN) has been theoretically shown to possess strong graph-discriminative capability [23]. Under mild assumptions (e.g., when node and edge attributes are not overly complex), the GIN can achieve expressiveness comparable to the Weisfeiler–Lehman (WL) graph isomorphism test [28], thereby effectively distinguishing different non-isomorphic graph structures. However, the standard GIN adopts an isotropic neighbor-aggregation scheme, implicitly assuming that all neighbors of a central node contribute equally.
In EEG–fNIRS emotion recognition, functional brain connectivity exhibits pronounced dynamics and heterogeneity: specific affective states tend to activate particular neural circuits rather than uniformly engaging the whole brain. Motivated by this observation and inspired by GAT [29], we propose an improved Attention-GIN (AGIN) module as the encoder. While preserving the strong topological discriminability of the GIN, the AGIN introduces an attention mechanism to enable anisotropic feature aggregation, adaptively focusing on informative neighboring nodes and aggregating salient features more effectively.
Given a brain network graph , let denote the feature vector of node at the l-th layer. The AGIN consists of two steps: attention-coefficient computation and weighted message passing. Specifically, the attention score between the central node and its neighbor is first computed as
To model interactions among nodes, we concatenate the transformed features of the central node and its neighbor and then compute the raw attention score via a learnable linear transformation and LeakyReLU activation, where is a learnable attention weight vector. Subsequently, a Softmax function is applied to normalize the attention scores over all neighbors, yielding the attention coefficient , which adaptively reflects the importance of neighbor to the central node . After obtaining the attention weights, we further modify the aggregation rule of the GIN. The standard GIN aggregation can be written as
Here, is a learnable parameter (or a fixed constant) that controls the contribution of the central node’s own feature, and denotes a multi-layer perceptron used to model nonlinear transformations. When aggregating neighborhood information, we weight each neighbor by its attention coefficient , as illustrated in Figure 3, which shows the attention computation and the overall AGIN framework. After obtaining the attention coefficients between a central node and its neighbors, the next-layer node representation is computed via weighted summation followed by an MLP. The AGIN aggregation is defined as
With this design, the AGIN can adaptively emphasize informative connections and suppress noisy links according to physiological-signal characteristics under affect elicitation, thereby strengthening information flow over key brain regions and producing more discriminative graph representations. Notably, the standard GAT performs attention-driven neighborhood aggregation, whereas the AGIN superimposes attention reweighting on top of the GIN-style MLP and sum aggregation, thereby preserving graph isomorphism expressive power. We adopt a two-layer AGIN encoder: the first layer maps the input from 5 to 16 dimensions, and the second layer maps the representation from 16 to 32 dimensions. The graph-level features obtained by aggregating node embeddings from these two layers are used for classification and subsequent processing. In addition, to alleviate the problem of vanishing or exploding gradients, we introduce residual connections. denotes the graph features output by the encoder.
3.2. EEG-fNIRS Cross-Modal Contrastive Learning
Although EEG and fNIRS signals are recorded simultaneously to capture neural activity during the same affect elicitation process, they exhibit pronounced heterogeneity: EEG reflects millisecond-level high-frequency electrophysiological fluctuations, whereas fNIRS manifests second-level low-frequency hemodynamic variations. Such substantial differences in physical attributes lead to feature distributions that reside on largely non-overlapping manifolds, making it difficult for simple concatenation to effectively exploit cross-modal complementarity.
Inspired by the success of CLIP [22] in vision–language representation learning, we transfer the contrastive learning paradigm to multi-modal brain-signal analysis. Similar to CLIP, which aligns image and text features via a dual-encoder framework, we design an EEG–fNIRS cross-modal contrastive learning module. Specifically, we employ the proposed AGIN as modality-specific encoders for EEG and fNIRS, respectively, and use contrastive learning to align the two heterogeneous physiological signals by projecting them into a unified latent semantic space.
First, the two constructed graphs and are fed into two non-shared encoders, and , each implemented by a stacked AGIN. For the i-th sample in a mini-batch containing N samples, the resulting high-dimensional graph-level representations are
where and denotes the dimensionality of the encoder output.
Prior studies [20] have shown that directly applying contrastive loss on the encoder outputs may discard information that is beneficial for downstream classification. Therefore, we introduce an independent nonlinear projection head after each encoder to map the high-dimensional representation onto a lower-dimensional hypersphere for similarity computation. The projection head is designed as a single-hidden-layer multi-layer perceptron (MLP), consisting of a fully connected layer, a ReLU activation, and a linear layer, followed by normalization. Taking the EEG branch as an example, the projected feature is computed as
where and denote the weight matrices and bias vectors of the MLP. The input and output dimensions of are 32 and 32, respectively, while those of are 32 and 128; together, they constitute a two-layer fully connected network. The fNIRS branch is computed in the same manner to obtain . The resulting vectors are then used to compute the contrastive loss and the corresponding pull–push distances.
To achieve feature alignment, we introduce a symmetric cross-modal contrastive loss in the feature-fusion stage, following the InfoNCE [30] formulation. The loss consists of two parts: (i) using EEG as the anchor to retrieve the corresponding fNIRS sample, denoted by , and (ii) using fNIRS as the anchor to retrieve the corresponding EEG sample, denoted by . For a mini-batch of size N, let and be the -normalized projected features of the i-th sample from EEG and fNIRS, respectively. We define positive pairs as matched EEG–fNIRS views from the same sample, i.e., , while the remaining mismatched pairs are treated as negatives. With this construction, the model is encouraged to identify correct cross-modal correspondences across all samples, thereby capturing shared affect elicitation patterns between the two modalities. The loss is defined as follows:
Here, denotes the cosine similarity, and is the temperature parameter (set to 0.1 in this paper) used to adjust the model’s discrimination sharpness over difficult samples. Specifically, maximizes the cosine similarity between each EEG sample and its corresponding fNIRS sample; symmetrically, maximizes the cosine similarity between each fNIRS sample and its corresponding EEG sample. The final cross-modal contrastive loss is the average of these two directional losses, and the overall cross-modal contrastive learning pipeline is illustrated in Figure 4.
In Figure 4, the red boxes represent the two paired modal samples. By jointly optimizing the two directions and minimizing , the model ensures that regardless of which modality is used as the query, the corresponding feature in the other modality can be accurately retrieved within the shared semantic space. This yields more robust semantic alignment and enables the model to learn affect-relevant structures shared across modalities (e.g., consistent affect elicitation patterns), thereby alleviating the distribution mismatch induced by heterogeneous data.
After alignment, the two modality-specific vectors become close to each other in the shared semantic space and tend to point in similar directions (indicating consistent affective semantics). We then fuse them by simple averaging to obtain the cross-modal representation , which is subsequently -normalized and fed into downstream classification. The fusion is defined as
3.3. Cross-Subject Supervised Contrastive Learning
Traditional supervised learning methods (e.g., cross-entropy loss ) mainly focus on learning decision boundaries that separate emotion classes on the training set. However, in the leave-one-subject-out (LOSO) setting, such approaches often cause the model to overfit subject-specific physiological patterns rather than learning truly subject-invariant affective characteristics. When encountering an unseen test subject, these idiosyncratic patterns no longer hold, resulting in a substantial performance drop.
To address this generalization bottleneck, we propose a cross-subject supervised contrastive learning strategy. The key idea is to explicitly perform “feature disentanglement” in the representation space: suppress subject-identity information while enhancing emotion-discriminative information.
In standard supervised contrastive learning (SupCon) [21], samples sharing the same emotion label are treated as positive pairs. This is insufficient for cross-subject tasks because it may simply pull together samples from the same subject. Therefore, we define a cross-subject positive-pair construction. Consider a training mini-batch containing N fused samples. For an anchor sample i, with representation, emotion label, and subject identity denoted by , , and , respectively, the positive set is defined by requiring the same emotion label ( ) but a different subject identity ( ):
Two samples are then drawn from to form the positive set for anchor i. Correspondingly, all samples in the mini-batch that do not satisfy the above conditions (i.e., samples with different emotion labels) are paired with the anchor to construct the negative set. It is worth noting that we do not use the sample itself to form a positive pair. This is because self-pairs do not provide additional cross-subject information for the same emotion class; excluding them explicitly blocks the shortcut of leveraging subject-identity cues to reduce contrastive distances.
Based on the above positive/negative sample construction, the cross-subject supervised contrastive loss is defined as
Here, denotes the set of all samples in the batch except anchor i, and is the number of positive samples. The parameter is the cross-subject temperature coefficient (set to 0.05 in this paper). By minimizing , the model pulls together samples from different subjects but with the same emotion label, encouraging their feature distributions to overlap while pushing apart samples from different emotion classes. This yields representations that are more robust to inter-subject variability and substantially improves generalization to unseen subjects. Algorithm 1 provides the pseudocode for computing the cross-subject contrastive loss. Algorithm 1 Cross-Subject supervised contrastive learning
- Input: A mini-batch of fused features ; emotion labels ; subject IDs ; temperature
- Output: Cross-subject supervised contrastive loss (Equation (15))
- 1:Normalize features:
- 2:Compute similarity matrix:
- 3:Mask self-pairs: set
- 4: ,
- 5:for to N do
- 6: Construct cross-subject positive set according to Equation (14)
- 7: Construct all-others set
- 8: if then
- 9: Compute anchor loss according to Equation (15) using and
- 10:
- 11:
- 12: end if
- 13:end for
- 14:
- 15:return
Note that C counts the number of valid anchors that admit at least one cross-subject positive. In rare cases, C can be 0 when a mini-batch contains no emotion class represented by at least two different subjects (e.g., all samples of a class come from a single subject, or some classes appear only once in the batch). We, therefore, use to avoid division by zero; when , the cross-subject contrastive loss is effectively set to zero for that batch, and only the remaining objectives contribute to the update.
3.4. Overall Model
To translate the extracted high-level semantic features into the final emotion predictions, we attach a linear classifier after the feature-fusion module. Specifically, the input is the -normalized fused multi-modal feature vector, and the classifier is implemented by a fully connected layer. The cross-entropy classification loss is defined as
where N is the batch size, C is the number of emotion classes, denotes the one-hot ground-truth label of the n-th sample with , and is the predicted probability for class i.
As illustrated in Figure 5, the overall objective combines the above losses into a multi-task joint optimization target. The total loss is formulated as a weighted sum of three terms:
Here, ensures discriminative representations for emotion classification, encourages consistent semantic representations between the EEG and fNIRS encoders to alleviate cross-modal heterogeneity, and suppresses subject-specific variations to learn subject-invariant affective representations, thereby improving cross-subject generalization. The hyperparameters and balance the contribution of the contrastive objectives against the primary classification objective. By minimizing , the model can be trained end-to-end to jointly optimize representation learning and classification performance.
4. Experimental Results and Analysis
4.1. Experimental Setup
To evaluate the effectiveness of the proposed method, experiments were conducted on our self-collected EEG–fNIRS emotion dataset TYUT3.0. Data from 30 participants across four emotions (sadness, happiness, neutral, and fear) were used for training, validation, and testing. Existing multi-modal EEG-based emotion recognition datasets [31] typically conduct validation with approximately 20–50 subjects; therefore, the sample size used in this work falls within the commonly adopted range in this field and is comparable to prior research. To rigorously assess emotion recognition performance under different experimental settings, we adopted two evaluation protocols:
- 1.Subject-dependent setting: For each participant, 5-fold cross-validation was performed by mixing and randomly shuffling samples from all subjects and then evenly splitting them into 5 folds. In each round, one fold was used as the test set. From the remaining four folds, one fold was further designated as the validation set, and the other three folds were used for training. The best hyperparameters were selected on the validation set, and the model checkpoint with the best validation performance was saved. The test set result was then taken as the performance for that round. Finally, we report the mean accuracy (Acc) and standard deviation over the five rounds to evaluate the effectiveness of the encoder and cross-modal contrastive learning.
- 2.Subject-independent setting: We employed the leave-one-subject-out (LOSO) protocol. Since the dataset contains 30 subjects, the subject-independent experiments were conducted over 30 rounds. In each iteration, the data from one subject were held out as an independent test set, while the data from the remaining 29 subjects were used for model training. Within these 29 training subjects, we also performed a 5-fold cross-validation at the subject level to construct an inner validation split, where four folds were used for training and the remaining fold was used as the validation set for best-model selection. Z-score normalization was computed using statistics from the inner training split only, ensuring the independence of both the validation and test sets. The reported performance is the mean accuracy and standard deviation over the 30 test rounds, which is used to verify the effectiveness of cross-subject contrastive learning.
In addition, to assess the statistical significance of the performance improvements, we conducted paired-sample t-tests on the experimental results and reported the 95% confidence intervals (CIs). A p-value smaller than 0.05 was considered statistically significant. All experiments were conducted on an NVIDIA Tesla T4 GPU using the PyTorch 2.0 deep learning framework. The software environment included Python 3.11 and CUDA 11.8, and PyTorch Geometric was used to implement the GIN-based models. We optimized the model parameters using the cross-entropy loss and the Adam optimizer with a learning rate of 0.001. The batch size was 64, and the maximum number of training epochs was 100. An early-stopping strategy was adopted, whereby training was terminated if the validation accuracy did not improve for 20 consecutive epochs.
4.2. Results
4.2.1. EEG-fNIRS Fusion Emotion Recognition Comparison Experiment
To demonstrate the superiority of the proposed model, we compared it with several baseline methods, including the conventional machine-learning approach SVM [32] and representative GNN models such as GCN [33], GraphSAGE [34], GAT [29], and GIN [23]. We first considered the subject-dependent setting. In this case, we removed the final cross-subject contrastive learning term and kept only the AGIN encoder and the cross-modal contrastive learning module. The detailed five-fold cross-validation results are summarized in Table 1.
As shown in Table 1, our method achieves a notably high recognition accuracy on the four-class task (Acc = 96.98%), outperforming the vanilla GIN baseline by approximately 2.41%. Based on the distribution of the collected results and the sample size, we additionally report the CI of the accuracy for each method. Moreover, we conduct a paired t-test between DC-AGIN and the baseline GIN, yielding , which indicates that the proposed improvement is statistically significant under the subject-dependent setting. In addition, among the compared GNN variants, the GIN provides the strongest baseline performance, supporting our choice of a GIN as the backbone for further improvements. These results indicate that when the distribution gap between training and test data is relatively small, the proposed DC-AGIN encoder can effectively capture fine-grained affective patterns from both EEG and fNIRS signals and align cross-modal representations, thereby improving subject-dependent emotion recognition performance.
To evaluate the model’s adaptability to new users in practical brain–computer interface (BCI) applications, we re-introduced the cross-subject supervised contrastive learning module and conducted LOSO experiments using the complete DC-AGIN model. Under this setting, the test set consists entirely of data from a new subject who is not included in training, leading to a pronounced inter-subject discrepancy between the training and test sets. The subject-independent LOSO results are reported in Table 2.
As shown in Table 2, DC-AGIN maintains stable performance in the cross-subject setting. However, compared with subject-dependent five-fold cross-validation, the accuracy drops from 96.98% to 62.56%, indicating that cross-subject recognition is substantially more challenging than the subject-dependent setting. This gap mainly arises because, in the subject-dependent protocol, the training and test samples come from the same pool of subjects, allowing the model to exploit subject-specific physiological response patterns and thus achieve seemingly higher classification accuracy. In contrast, under the LOSO setting, the distribution shift of unseen subjects makes identity-related representations difficult to transfer directly, and the model may mistakenly treat inter-subject differences as emotion-related cues, leading to a pronounced degradation in generalization performance. Under LOSO evaluation, our model still substantially outperforms conventional methods and improves upon the vanilla GIN baseline by approximately 10.44%. This gain primarily stems from the proposed cross-subject supervised contrastive learning module, which effectively suppresses subject-specific identity noise and encourages the encoder to learn more generalizable affective patterns, thereby achieving smoother generalization to unseen subjects. Under the LOSO protocol with , we performed a paired t-test between DC-AGIN and the baseline GIN and obtained , which likewise indicates that the proposed improvement is statistically significant in the subject-independent setting.
In addition, we further compared our model with recent state-of-the-art techniques. Since most existing cross-subject emotion recognition studies focus on single-modality EEG and are evaluated on public datasets, to ensure a fair comparison with existing methods, we follow a minimal-adaptation principle when implementing the reproduced baselines. Specifically, we only unify the input interfaces (i.e., multi-modal channel numbers and feature dimensions), fusion strategy, and classification head so that each method can process the EEG-fNIRS multi-modal data in our study, without altering its core architecture or loss formulation. All baselines and DC-AGIN use exactly the same data splits and LOSO evaluation protocol, and hyperparameters are tuned within an identical search space. For each method, the final performance is obtained by selecting the best hyperparameters based on the validation set within the training fold and then testing on the held-out subject. The comparative results are reported in Table 3.
Here, ref. [13] is an EEG–fNIRS framework based on a GCN and capsule networks, and [35] performs multi-modal emotion recognition by extracting causal features. However, most of these methods are designed around feature-network engineering and do not specifically optimize for cross-subject emotion recognition. In the LOSO evaluation, our DC-AGIN achieves an average accuracy of 62.56%, outperforming all compared methods. Although recent studies [36,37,38] have attempted to incorporate transfer or contrastive strategies and have achieved performance gains, DC-AGIN remains competitive under LOSO validation.
Compared with the domain-adversarial model DANN [36], which minimizes the distribution discrepancy between source- and target-domain data via adversarial learning, DC-AGIN improves accuracy by approximately 3.75%. Compared with the best-performing spatiotemporal representation learning model, CLISA [37], DC-AGIN achieves an additional gain of about 1.78%. Moreover, paired-sample t-tests indicate that all comparisons yield , confirming that the performance advantages of DC-AGIN are statistically significant.
These results further suggest that the proposed cross-modal contrastive learning is markedly more effective than other multi-modal fusion strategies. Although several competing methods are carefully designed to capture subject-specific characteristics, most of them are not explicitly optimized for cross-subject settings; consequently, under distribution shifts induced by unseen subjects, they tend to overfit the source-domain data and exhibit limited generalization. In addition, under the availability of explicit emotion labels and subject identities, the proposed cross-subject contrastive learning provides stronger cross-subject emotion recognition performance than domain adaptation approaches. Together with the alignment of EEG and fNIRS representations, this design plays a critical role in multi-modal cross-subject emotion recognition by enabling truly subject-invariant affective feature extraction, ultimately leading to improved recognition performance.
4.2.2. Ablation Experiment
To quantify the contribution of each key component in the DC-AGIN framework to cross-subject emotion recognition, we conducted a series of ablation studies under the LOSO protocol. The baseline model uses the vanilla GIN as the encoder, applies simple concatenation for multi-modal fusion, and is trained with the standard cross-entropy loss. AGIN replaces the GIN encoder with the attention-enhanced AGIN encoder only ( ). CM introduces only the cross-modal contrastive learning objective for feature alignment ( ). CS introduces only the cross-subject supervised contrastive learning objective ( ). DC-AGIN is the complete dual-contrastive AGIN framework that jointly optimizes the three losses. Table 4 reports the LOSO results of all variants.
Compared with the baseline (53.12%), the DC-AGIN (AGIN) variant, which introduces only the attention mechanism, improves the accuracy to 54.26%. This indicates that relative to the isotropic mean aggregation in the standard GIN, AGIN can adaptively assign higher weights to key brain regions for affective recognition, thereby filtering out task-irrelevant background noise to some extent and extracting more discriminative graph features.
By incorporating the cross-modal contrastive loss, DC-AGIN (CM) further increases the accuracy to 56.62%, yielding a 3.50% improvement over the baseline. This gain suggests that due to the heterogeneous physical characteristics of EEG and fNIRS signals, direct fusion alone is unlikely to be optimal. Cross-modal contrastive learning explicitly pulls temporally paired heterogeneous signals closer in the latent space, enabling semantic alignment and enhancing cross-modal complementarity.
Notably, introducing only the cross-subject supervised contrastive loss leads DC-AGIN (CS) to achieve an accuracy of 60.04%, corresponding to a 6.92% improvement—the largest gain among all single-module variants. In the ablation study, CS produces the largest gain, aligning with the key challenge in the subject-independent setting: strong inter-subject distribution shifts. Without explicitly suppressing identity-related cues, the model may overfit subject-specific patterns rather than emotion-relevant features; CS enforces subject-invariant representations by pairing same emotion, different subject positives, thus improving generalization the most. By contrast, CM mainly reduces the representation gap between EEG and fNIRS, improving cross-modal alignment and training stability, so its standalone LOSO gain is smaller but it strengthens robustness when combined with CS. Importantly, CM and CS are not additive: their effects partially overlap and the multi-loss constraints are coupled with diminishing returns, so the final improvement is smaller than summing individual gains.
Finally, the full DC-AGIN model, which integrates all components, achieves the best overall performance, reaching an average accuracy of 62.56% with a standard deviation reduced to 8.57 (the lowest among all variants). Moreover, paired t-tests between DC-AGIN and each ablated variant show that all comparisons yield , indicating that DC-AGIN significantly outperforms the ablated models. These results demonstrate that attention-based aggregation, cross-modal alignment, and cross-subject disentanglement are complementary rather than mutually exclusive. The complete framework not only improves recognition accuracy but also substantially enhances robustness when generalizing to unseen subjects. In addition, we recorded the computation time of the LOSO experiments for both the baseline and DC-AGIN. The total training time for each method was approximately 15 h, indicating a relatively high training cost, while the per-sample inference latency on the GPU was about 2.8 ms. We expect future work to further reduce the computational complexity; nonetheless, these results suggest that DC-AGIN improves emotion recognition performance without a substantial increase in computation time.
To evaluate the sensitivity of the proposed model to key hyperparameters, we further analyze the effects of the cross-modal contrastive temperature , the cross-subject supervised contrastive temperature , and the corresponding loss weights and on emotion recognition performance. Specifically, we conduct controlled experiments by varying over and over . As shown in Figure 6, the performance varies moderately within a relatively wide parameter range, and the best results are obtained near the default configuration. In particular, when , , and , the model achieves the highest classification accuracy (92.65%), indicating that the proposed method exhibits reasonable robustness while benefiting from an appropriate choice of temperatures and loss-weight balancing. Further details are reported as follows:
- 1.Temperature analysis: The temperature is used in cross-modal contrastive learning, where positive and negative pairs are drawn from different modalities. Because cross-modal representations typically exhibit larger distribution shifts and more dispersed similarities, a relatively larger temperature (0.1) is adopted to smooth the logits, preventing an overly sharp similarity distribution from amplifying modality noise and causing unstable gradient updates. In contrast, is used in cross-subject supervised contrastive learning, where positive pairs satisfy the same emotion, different subject criterion. Within the shared fused representation space, the intra-class variations are comparatively controllable, yet stronger discriminative constraints are required to suppress subject-identity-related features. Therefore, a smaller temperature (0.05) is employed to strengthen the pull between same-class samples and the push between different-class samples.
- 2.Loss-weight analysis: The two weights and are introduced to balance the relative importance between the contrastive objectives and the primary classification objective. In other words, they control the trade-off among cross-modal semantic alignment, cross-subject-identity invariance, and emotion discriminability. When is too small, the contrastive losses are insufficient to mitigate modality discrepancies or inter-subject biases; when is too large, the contrastive objectives may dominate optimization and inadvertently weaken task-relevant discriminative cues for classification. Consequently, an appropriate weight (0.1) provides a better balance for cross-subject emotion recognition.
Effects of the contrastive learning temperature coefficient τ and the loss weight λ on the subject-independent emotion recognition performance of DC-AGIN.
To verify the effectiveness of EEG–fNIRS feature fusion and cross-modal contrastive learning, we conducted LOSO experiments using a single modality (EEG-only or fNIRS-only), and the results are reported in Table 5. The first two rows correspond to the uni-modal settings, where the model is trained and evaluated with only the AGIN encoder and the cross-subject contrastive learning module. As shown in the table, the multi-modal modeling of EEG and fNIRS (including cross-modal contrastive learning and feature fusion) is necessary: compared with uni-modal EEG and uni-modal fNIRS, the multi-modal model improves the recognition accuracy by 2.97% and 20.11%, respectively. Moreover, paired-sample t-tests against the corresponding methods in the first two rows of Table 5 yield , indicating that the improvements are statistically significant. It can also be observed that the multi-modal gain over uni-modal fNIRS is substantially larger, mainly because HbO and HbR concentration variations evolve relatively slowly and thus provide limited discriminative power when used alone for affective representation. Overall, these modality ablation results demonstrate that the proposed model can effectively fuse the complementary affective cues from EEG and fNIRS, providing more comprehensive results.
4.2.3. Visualization Analysis
To visually verify the feature learning capability of DC-AGIN, we performed a comparative t-SNE visualization between the raw multi-modal features at the network input and the high-level semantic features obtained after the dual-stream encoders and the dual-contrastive learning scheme. The t-SNE hyperparameters were set as follows: the perplexity was set to 30, which lies within the commonly recommended range of 5–50 and provides a good trade-off between local and global structure; the random seed was fixed to 42 to ensure reproducibility; and the number of iterations was set to 1000 to facilitate convergence. As shown in Figure 7, the original feature space in (a) exhibits a highly entangled distribution: samples from different emotion classes overlap substantially and lack separable cluster structures. This reflects that the raw EEG–fNIRS signals contain considerable background noise and that the heterogeneous modalities suffer from a large distribution gap, resulting in poor linear separability. In contrast, the learned feature space in (b) shows a clear structural improvement. Previously scattered samples begin to form bounded manifold-like clusters according to affective semantics (e.g., Sad in blue and Happy in red become increasingly separable), and both intra-class compactness and inter-class separability are markedly enhanced. Quantitatively, the raw features yield a Silhouette score of 0.089 and a Davies–Bouldin index (DBI) of 3.847, indicating that the four emotion classes are highly entangled in the original feature space. In contrast, the fused features extracted by DC-AGIN achieve a higher Silhouette score of 0.312 and a lower DBI of 2.236, reflecting improved clustering quality and providing quantitative evidence that the proposed method effectively enhances separability among emotion classes. This transition from disordered to structured distributions strongly indicates that DC-AGIN can effectively suppress low-level physiological noise and map heterogeneous signals into a unified high-dimensional discriminative space, thereby providing a solid representation foundation for subsequent high-accuracy emotion classification.
Figure 8 presents the confusion matrices of DC-AGIN under the subject-dependent (a) and subject-independent (b) evaluation protocols. Panel (a) indicates that the model achieves near-perfect classification performance, suggesting that when the training set contains partial data from the same participants, the model can readily capture subject-specific physiological patterns and learn clear decision boundaries. However, this idealized setting may mask the critical inter-subject variability challenge in affective computing. More practically relevant is the subject-independent LOSO performance shown in (b). Despite the distribution shift across unseen subjects, the model still preserves satisfactory discriminability, with an average diagonal accuracy of approximately 62.9%.
By examining the off-diagonal misclassification patterns, we find that the confusions are not random. A prominent mutual confusion occurs between Happy and Fear, which can be attributed to their high similarity in arousal; consequently, it is difficult for the model to distinguish these two high-energy affective states solely based on the intensity of physiological activation. Another major confusion arises between Sad and Neutral. Sadness is often associated with suppressed neural activity or low-energy states, which resemble the baseline Neutral condition in both EEG power spectra and hemodynamic responses. As a result, some sadness samples with weak fluctuations tend to be classified as the most stable neutral-like state.
Figure 9 reports the per-subject emotion recognition performance under the LOSO protocol, where the blue and red curves correspond to the baseline method and the proposed DC-AGIN, respectively. It can be observed that both methods exhibit noticeable fluctuations across different subjects, indicating that cross-subject emotion recognition is still affected by individual variability and differences in signal quality. Nevertheless, DC-AGIN outperforms the GIN on the vast majority of subjects and yields a more stable overall trend, suggesting that the proposed cross-subject supervised contrastive constraint can effectively suppress identity-related noise and improve the robustness of generalization.
To investigate the physiological basis of the model’s decisions, we project the node-importance weights learned by the AGIN onto a two-dimensional channel-location distribution matrix. As shown in Figure 10, panel (a) provides the spatial position mapping matrix, and panel (b) visualizes the learned attention weights. The attention-weight visualization reflects a global pattern obtained by aggregating the attention weights of all test samples: we first compute the AGIN attention weight matrix for each sample and then average these matrices across all samples under the same setting to obtain a stable global attention distribution. The results clearly indicate that the prefrontal cortical region exhibits the most prominent activation (darkest color), while relatively high weights are also observed over the bilateral temporal areas. This distribution has clear neurophysiological implications: both the prefrontal and temporal lobes are known to be involved in emotion regulation and affect-related cognitive processing. These observations suggest that DC-AGIN can adaptively suppress background noise from less relevant regions and precisely focus on topologically informative nodes that are most discriminative for emotion recognition, thereby offering favorable neurophysiological interpretability.
5. Conclusions
In this work, we propose DC-AGIN, a novel multi-modal fusion framework tailored for EEG–fNIRS emotion recognition. The framework integrates an attention-enhanced graph neural encoder with a dual-contrastive learning mechanism to address two key challenges in multi-modal cross-subject emotion recognition: modality heterogeneity and inter-subject variability. Specifically, the AGIN is first employed to adaptively aggregate topological features from key brain regions. Building upon this, cross-modal contrastive learning aligns the temporal dynamics of EEG with the spatially informative characteristics of fNIRS in the representation space. More importantly, we design a cross-subject supervised contrastive learning objective to explicitly suppress subject-identity noise, enabling the model to learn subject-invariant affective representations. Extensive experiments on the TYUT3.0 dataset demonstrate that DC-AGIN achieves strong accuracy and robustness under both five-fold subject-dependent validation and LOSO subject-independent evaluation, reaching state-of-the-art performance on this dataset.
In addition to its superior classification performance, this study provides strong evidence for the underlying neural mechanisms of emotion. This robustness, which is independent of specific subject calibration data, makes it a promising candidate for practical applications such as driver fatigue monitoring, mental stress assessment, and clinical diagnosis of affective disorders.
6. Discussion
Given the sensitivity of brain-signal data, this study places strong emphasis on data privacy and security and fully acknowledges the potential misuse risks of affective recognition technologies, such as unauthorized use, privacy infringement, and inappropriate decision support. Therefore, all data acquisition, storage, and processing in this work were conducted under strict ethical regulations and oversight, and the data were used solely for legitimate research purposes. Nevertheless, several limitations remain and warrant further investigation.
First, participants were primarily recruited among Chinese university students, leading to relatively homogeneous age and cultural backgrounds, which may limit the generalizability of the findings to broader populations. It should be noted that most publicly available EEG multi-modal emotion recognition datasets (e.g., SEED [31]) and related studies also predominantly involve university students or young adults, and their generalization ability may likewise be constrained. In addition, this study has not been independently validated on external public datasets; future work should incorporate more diverse samples and conduct systematic evaluations to more rigorously assess robustness and transferability.
Second, the EEG and fNIRS features used in this study are mainly segment-level static representations. While they can effectively support the current experimental paradigm and classification task, they do not explicitly model the temporal dynamics of affective states. Regarding feature construction, the fNIRS representation relies primarily on statistical descriptors, which may be insufficient to capture fine-grained hemodynamic variations and thus may limit the contribution of the fNIRS modality to affective discrimination. Future work will explore richer fNIRS features, such as time–frequency and wavelet-based dynamic descriptors, to enhance representational capacity. Moreover, incorporating dynamic feature extraction and temporal modeling frameworks may further improve the ability to characterize continuous affective changes in real-world scenarios.
In addition, the proposed framework has the potential to be extended to additional physiological modalities, such as ECG, GSR or EOG, which can provide complementary autonomic nervous system cues for affective states. When introducing new modalities, future studies may follow the unified graph + contrastive learning paradigm adopted in this work: each modality can be encoded into modality-specific representations and aligned into a shared semantic space via cross-modal contrastive learning and/or consistency constraints. Furthermore, more advanced fusion strategies can be explored to achieve more thorough multi-modal integration, thereby forming a plug-and-play and extensible multi-physiological affective recognition framework.
Finally, although the current model exhibits the potential for real-time inference on GPUs, training remains computationally intensive, and its applicability across different devices, noise conditions, and clinical populations requires further validation. Future work will investigate model compression and acceleration to reduce training and deployment costs and to develop an efficient, low-latency online affective BCI system. In parallel, we will explore more advanced domain adaptation or adversarial techniques to further improve the model’s adaptability in cross-subject and even cross-dataset affective recognition settings.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1Afzal S. Khan H.A. Piran M.J. Lee J.W. A comprehensive survey on affective computing: Challenges, trends, applications, and future directions IEEE Access 202412961509616810.1109/ACCESS.2024.3422480 · doi ↗
- 2Mühl C. Allison B. Nijholt A. Chanel G. A survey of affective brain computer interfaces: Principles, state-of-the-art, and challenges Brain-Comput. Interfaces 20141668410.1080/2326263 X.2014.912881 · doi ↗
- 3Kaliouby R.E. Picard R. Baron-Cohen S. Affective computing and autism Ann. N. Y. Acad. Sci.2006109322824810.1196/annals.1382.01617312261 · doi ↗ · pubmed ↗
- 4Koelstra S. Muhl C. Soleymani M. Lee J.S. Yazdani A. Ebrahimi T. Pun T. Nijholt A. Patras I. Deap: A database for emotion analysis; using physiological signals IEEE Trans. Affect. Comput.20113183110.1109/T-AFFC.2011.15 · doi ↗
- 5Susanto I.Y. Pan T.Y. Chen C.W. Hu M.C. Cheng W.H. Emotion recognition from galvanic skin response signal based on deep hybrid neural networks Proceedings of the 2020 International Conference on Multimedia Retrieval Dublin, Ireland 8–11 June 2020341345
- 6Wang F. Mao M. Duan L. Huang Y. Li Z. Zhu C. Intersession instability in f NIRS-based emotion recognition IEEE Trans. Neural Syst. Rehabil. Eng.2018261324133310.1109/TNSRE.2018.284246429985142 · doi ↗ · pubmed ↗
- 7Ramadan M.A. Salem N.M. Mahmoud L.N. Sadek I. Multimodal machine learning approach for emotion recognition using physiological signals Biomed. Signal Process. Control 20249616655310.1016/j.bspc.2024.106553 · doi ↗
- 8Codina T. Blankertz B. Lühmann A.V. Multimodal f NIRS–EEG sensor fusion: Review of data-driven methods and perspective for naturalistic brain imaging Imaging Neurosci.20253 IMAG-a 10.1162/IMAG.a.974PMC 1259238241211102 · doi ↗ · pubmed ↗
