DMMG: Dual Min-Max Games for Self-Supervised Skeleton-Based Action Recognition
Shannan Guan, Xin Yu, Wei Huang, Gengfa Fang, Haiyan Lu

TL;DR
This paper introduces DMMG, a self-supervised learning approach for skeleton-based action recognition that uses dual adversarial min-max games to generate challenging contrastive pairs through viewpoint variation and edge perturbation.
Contribution
The paper presents a novel dual min-max game framework that enhances self-supervised skeleton action recognition by generating diverse, hard contrastive samples via viewpoint and edge perturbation strategies.
Findings
Achieves superior accuracy on NTU-RGB+D datasets.
Effectively captures discriminative action features.
Outperforms existing self-supervised methods.
Abstract
In this work, we propose a new Dual Min-Max Games (DMMG) based self-supervised skeleton action recognition method by augmenting unlabeled data in a contrastive learning framework. Our DMMG consists of a viewpoint variation min-max game and an edge perturbation min-max game. These two min-max games adopt an adversarial paradigm to perform data augmentation on the skeleton sequences and graph-structured body joints, respectively. Our viewpoint variation min-max game focuses on constructing various hard contrastive pairs by generating skeleton sequences from various viewpoints. These hard contrastive pairs help our model learn representative action features, thus facilitating model transfer to downstream tasks. Moreover, our edge perturbation min-max game specializes in building diverse hard contrastive samples through perturbing connectivity strength among graph-based body joints. The…
| Method | Stream | NTU-60 | NTU-120 | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|
| Xsub | Xview | Xsub | Xset | ||||||||
| Acc. (%) | Acc. (%) | Acc. (%) | Acc. (%) | ||||||||
| SkeletonCLR [16] | J | 68.3 | +13.8 | +10.9 | 76.4 | +10.7 | +10.2 | 56.8 | +12.8 | 55.9 | +14.2 |
| CrosSCLR [16] | J | 72.9 | +9.2 | +6.3 | 79.9 | +7.2 | +6.7 | 61.3 | +8.3 | 60.5 | +9.6 |
| AimCLR [15] | J | 74.3 | +7.8 | +4.9 | 79.7 | +7.4 | +6.9 | 63.4 | +6.2 | 63.4 | +6.7 |
| SkeleMixCLR [17] | J | 79.6 | +2.5 | -0.4 | 84.4 | +2.7 | +2.2 | 67.4 | +2.2 | 69.6 | +0.5 |
| DMMG (Ours) | J | 79.2 | +2.9 | - | 86.6 | +0.5 | - | - | - | - | - |
| DMMG (Ours) | J | 82.1 | - | -2.9 | 87.1 | - | -0.5 | 69.6 | - | 70.1 | - |
| SkeletonCLR [16] | M | 53.3 | +23.4 | +19.6 | 50.8 | +30.2 | +28.9 | 39.6 | +23.7 | 40.2 | +21.9 |
| CrosSCLR [16] | M | 72.5 | +4.2 | +0.4 | 77.6 | +3.4 | +2.1 | 59.4 | +3.9 | 59.2 | +2.9 |
| AimCLR [15] | M | 66.8 | +9.9 | +6.1 | 70.6 | +10.4 | +9.1 | 57.3 | +6.0 | 54.4 | +7.7 |
| SkeleMixCLR [17] | M | 70.3 | +6.4 | +2.6 | 76.1 | +4.9 | +3.6 | 49.7 | +13.6 | 53.8 | +8.3 |
| DMMG (Ours) | M | 72.9 | +3.8 | - | 79.7 | +1.3 | - | - | - | - | - |
| DMMG (Ours) | M | 76.7 | - | -3.8 | 81.0 | - | -1.3 | 63.3 | - | 62.1 | - |
| 2s-SkeletonCLR [16] | J+M | 75.0 | +9.2 | +7.5 | 79.8 | +9.5 | +9.2 | 60.7 | +12.0 | 62.6 | +9.8 |
| 2s-CrosSCLR [16] | J+M | 77.8 | +6.4 | +4.7 | 83.4 | +5.9 | +5.6 | 66.7 | +6.0 | 65.1 | +7.3 |
| 3s-AimCLR [15] | J+M+B | 78.9 | +5.3 | +3.6 | 83.8 | +5.5 | +5.2 | 68.2 | +4.5 | 68.8 | +3.6 |
| 3s-SkeleMixCLR [17] | J+M+B | 81.0 | +3.2 | +1.5 | 85.6 | +3.7 | +3.4 | 69.1 | +3.6 | 69.9 | +2.5 |
| 2s-DMMG (Ours) | J+M | 82.5 | +1.7 | - | 89.0 | +0.3 | - | - | - | - | - |
| 2s-DMMG (Ours) | J+M | 84.2 | - | -1.7 | 89.3 | - | -0.3 | 72.7 | - | 72.4 | - |
| Method | Backbone | Xsub (%) | Xset (%) |
|---|---|---|---|
| Single-stream | |||
| LongT GAN [38] | GRU | 52.1 | 56.4 |
| MS2L [39] | GRU | 52.6 | - |
| PCRP [50] | RNN | 53.9 | 63.5 |
| AS-CAL [51] | LSTM | 58.5 | 64.8 |
| P&C [34] | RNN | 50.7 | 76.3 |
| CRRL [52] | GRU | 67.6 | 73.8 |
| ISC [53] | GCN | 76.3 | 85.2 |
| SkeleMixCLR+ [17] | GCN | 80.7 | 85.5 |
| DMMG (Ours) | GCN | 82.1 | 87.1 |
| Multi-stream | |||
| 3s-CrosSCLR (LSTM) [16] | LSTM | 62.8 | 69.2 |
| 3s-SkeletonCLR [16] | GCN | 75.0 | 79.8 |
| 3s-Colorization [33] | GCN | 75.2 | 83.1 |
| 3s-CrosSCLR [16] | GCN | 72.8 | 80.7 |
| 3s-CrosSCLR [16] | GCN | 77.8 | 83.4 |
| 3s-SkeleMixCLR+ [17] | GCN | 82.7 | 87.1 |
| 2s-DMMG (Ours) | GCN | 84.2 | 89.3 |
| Method | Backbone | Xsub (%) | Xset (%) |
|---|---|---|---|
| Single-stream | |||
| LongT GAN [38] | GRU | 35.6 | 39.7 |
| P&C [34] | RNN | 42.7 | 41.7 |
| CRRL [52] | GRU | 56.2 | 57.0 |
| PCRP [50] | RNN | 41.7 | 45.1 |
| AS-CAL [51] | LSTM | 48.6 | 49.2 |
| ISC [53] | GCN | 67.9 | 67.1 |
| SkeleMixCLR+ [17] | GCN | 69.0 | 68.2 |
| DMMG (Ours) | GCN | 69.6 | 70.1 |
| Multi-stream | |||
| 3s-CrosSCLR (LSTM) [16] | LSTM | 53.9 | 53.2 |
| 3s-SkeletonCLR [16] | GCN | 60.7 | 62.6 |
| 3s-CrosSCLR [16] | GCN | 67.9 | 66.7 |
| 3s-SkeleMixCLR+ [17] | GCN | 70.5 | 70.7 |
| 2s-DMMG (Ours) | GCN | 72.7 | 72.4 |
| Method | Label Fraction | NTU-60 (%) | |
|---|---|---|---|
| Xsub | Xview | ||
| LongT GAN [38] | 1% | 33.1 | - |
| MS2L [39] | 1% | 35.2 | - |
| ISC [53] | 1% | 35.7 | 38.1 |
| 3s-CrosSCLR [16] | 1% | 51.1 | 50.0 |
| 3s-Colorization [33] | 1% | 48.3 | 52.5 |
| 3s-AimCLR [15] | 1% | 54.8 | 54.3 |
| 3s-SkeleMixCLR [17] | 1% | 55.3 | 55.7 |
| 2s-DMMG (Ours) | 1% | 56.1 | 56.6 |
| LongT GAN [38] | 10% | 62.0 | - |
| MS2L [39] | 10% | 65.2 | - |
| ISC [53] | 10% | 65.9 | 72.5 |
| 3s-CrosSCLR [16] | 10% | 74.4 | 77.8 |
| 3s-Colorization [33] | 10% | 71.7 | 78.9 |
| 3s-AimCLR [15] | 10% | 78.2 | 81.6 |
| 3s-SkeleMixCLR [17] | 10% | 79.9 | 83.6 |
| 2s-DMMG (Ours) | 10% | 81.8 | 85.1 |
| Method | Evaluation | NTU-60 (%) | NTU-120 (%) | ||
|---|---|---|---|---|---|
| Xsub | Xview | Xsub | Xset | ||
| 2s-STGCN [44] | Fully-Supervised | 85.0 | 91.2 | 77.0 | 77.2 |
| 2s-ASGCN [5] | Fully-Supervised | 88.5 | 95.1 | 80.5 | 82.6 |
| 2s-DMMG (STGCN) | Linear | 84.2 | 89.3 | 72.7 | 72.4 |
| 2s-DMMG (ASGCN) | Linear | 86.1 | 92.7 | 76.2 | 78.9 |
| 2s-DMMG (STGCN) | Finetune | 87.9 | 94.2 | 82.4 | 83.0 |
| 2s-DMMG (ASGCN) | Finetune | 88.7 | 95.2 | 83.3 | 84.2 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHuman Pose and Action Recognition · Anomaly Detection Techniques and Applications · Gait Recognition and Analysis
MethodsContrastive Learning
DMMG: Dual Min-Max Games for Self-Supervised Skeleton-Based Action Recognition
Shannan GUAN1, Xin YU2, Wei HUANG3, Gengfa FANG4, and Haiyan LU1
1 Australia Artificial Intelligence Institute
University of Technology Sydney, Australia, AU
2School of Information Technology and Electrical Engineering
University of Queensland, Australia, AU
3RIKEN Center for Advanced Intelligence Project, Tokyo, Japan, JP
4School of Electrical and Data Engineering
University of Technology Sydney, Australia, AU
Abstract
In this work, we propose a new Dual Min-Max Games (DMMG) based self-supervised skeleton action recognition method by augmenting unlabeled data in a contrastive learning framework. Our DMMG consists of a viewpoint variation min-max game and an edge perturbation min-max game. These two min-max games adopt an adversarial paradigm to perform data augmentation on the skeleton sequences and graph-structured body joints, respectively. Our viewpoint variation min-max game focuses on constructing various hard contrastive pairs by generating skeleton sequences from various viewpoints. These hard contrastive pairs help our model learn representative action features, thus facilitating model transfer to downstream tasks. Moreover, our edge perturbation min-max game specializes in building diverse hard contrastive samples through perturbing connectivity strength among graph-based body joints. The connectivity-strength varying contrastive pairs enable the model to capture minimal sufficient information of different actions, such as representative gestures for an action while preventing the model from overfitting. By fully exploiting the proposed DMMG, we can generate sufficient challenging contrastive pairs and thus achieve discriminative action feature representations from unlabeled skeleton data in a self-supervised manner. Extensive experiments demonstrate that our method achieves superior results under various evaluation protocols on widely-used NTU-RGB+D and NTU120-RGB+D datasets.
Index Terms:
Self-supervised learning, adversarial learning, contrastive learning, skeleton action recognition, min-max game.
I Introduction
Skeleton-based human action recognition has attracted many researchers’ attention for decades [1]. With the advance of depth cameras (e.g., Kinect sensor) and robust 3D pose estimation methods [2, 3, 4], 3D skeleton data becomes more accessible. In the past few years, many fully-supervised skeleton-based action recognition methods [5, 6, 7, 8, 9, 10, 11] achieve high accuracy by elaborately designing models. However, fully-supervised methods heavily rely on a large amount of annotated skeleton data, but annotating data is high-cost and time-consuming. To complement the lack of annotated skeleton data, some recent action recognition works [12, 13, 14, 15, 16, 17] have delved into exploring self-supervised approaches.
In the self-supervised skeleton-based action recognition task, contrastive learning [18, 19] has been introduced to learn feature representations. Those self-supervised action recognition methods [15, 16, 20, 17] first augment skeleton data to construct contrastive pairs (i.e., positive and negative pairs). Then, they learn action features by increasing the similarity between positive pairs while decreasing the similarity between negative pairs using a contrastive loss [21, 22, 23].
Although those self-supervised action recognition methods achieve promising results, they still face the following challenges: (1) Their data augmentation strategies are based on the random mechanism. Those methods may not generate sufficient hard contrastive pairs to learn representative action features [24, 25, 26]. (2) Existing skeleton data augmentation strategies [15, 16, 20] may construct misleading positive pairs, and these pairs can cause ambiguity in model learning. For example, the crop operation may slice the sit-down action into two parts: a sit sequence and a stand sequence, and using these two sequences as positive pairs would harm model learning. (3) Several methods may not focus on the most discriminative and representative features [27]. For example, many people habitually swing their arms while walking, but learning swinging arms may not best represent walking actions. How to effectively capture the most representative motions for action recognition still remains challenging.
In this paper, we propose a viewpoint variation min-max game and an edge perturbation min-max game, and we combine them as Dual Min-Max Games (DMMG) to address the aforementioned three challenges. Our two min-max games are built on an adversarial paradigm to perform data augmentation on the skeleton sequences and graph-structured body joints, respectively. We can construct a large number of sufficient challenging contrastive pairs by fully exploiting our DMMG, and we leverage these hard contrastive pairs to learn more representative and discriminative action feature representations. In this fashion, we can improve the model performance on downstream tasks.
To be specific, our viewpoint variation min-max game augments skeleton sequences from various viewpoints. We first try to find various viewpoints to increase the visual differences for an identical 3D skeleton sequence. Then, we maximize the similarity between the feature representations of those skeleton sequences under different viewpoints. As our viewpoint variation min-max game only finds a new viewpoint to render original skeletons, we can preserve the structural and temporal consistency of the skeleton sequences during the data augmentation. In this way, our constructed positive pairs do not have ambiguity and they will facilitate learning distinctive action features.
Our edge perturbation min-max game augments graph-based body joints to construct connectivity-strength varying contrastive pairs by perturbing their connectivity strengths. It minimizes the correspondence between the original and augmented graph-based body joints by reducing the connectivity strengths and then maximizes the similarity between their feature representations [28, 29, 30]. For example, we can augment graph-based body joints of a walking action with minimal connectivity strengths among arm-related joints while preserving the connectivity strengths among the leg-related joints. In this manner, important joint connections will be highlighted to represent an action. Therefore, our edge perturbation min-max game will learn to identify representative joint connections for action recognition. Moreover, thanks to these connectivity-strength varying contrastive pairs, we can better prevent overfitting.
As shown in Fig. 1, the action features represented by our DMMG are more discriminative than other methods. Moreover, extensive experimental results on NTU-RGB+D datasets [31, 32] demonstrate that our DMMG boosts the model’s performance on various downstream tasks. Our main contributions can be summarised as follows:
- •
We propose a novel Dual Min-Max Games (DMMG) under an adversarial paradigm for self-supervised skeleton action recognition in a contrastive learning framework.
- •
Our DMMG performs a challenging data augmentation on the skeleton sequences and graph-structured body joints. It constructs a large number of unambiguous hard contrastive pairs in learning discriminative action features.
- •
Our DMMG constructs connectivity-strength varying contrastive pairs and allows us to capture pivotal representations of actions while avoiding overfitting issues.
- •
Our DMMG achieves state-of-the-art performance under various evaluation protocols on two benchmark datasets: NTU RGB+D 60 and NTU RGB+D 120.
II Related Work
II-A Self-Supervised Skeleton-Based Action Recognition
Self-supervised skeleton-based action recognition aims to learn action feature representations from numerous unlabeled 3D skeleton sequences. Due to the lack of label supervision, self-supervised learning generally requires an intra-supervision signal derived from unlabeled data [16]. In the early stage, some methods generate supervision signals by designing pretext tasks, such as reconstructing skeleton sequence [12], using colorization [33], autoregression [13, 34], prediction motion [35], and jigsaw puzzles [36, 37]. To explore more useful supervision signals, some methods reconstruct skeleton sequences by using a generative adversarial network (GAN). For example, LongT GAN [38] proposes an auto-encoder-based GAN to reconstruct the sequential information of skeletons. P&C [34] utilizes an encoder-decoder framework to reconstruct a skeleton sequence and learn action features. Colorization [33] designs a colorized point cloud to represent skeleton data and utilizes an auto-encoder framework to learn spatio-temporal features from these hand-crafted colorized skeleton joints. MS2L [39] develops a multi-task learning framework by combining contrastive learning with pretext tasks. However, these methods often rely on pretext tasks and may not generalize well on downstream tasks.
Recently, the contrastive learning mechanism has demonstrated promising performance in self-supervised action recognition. The supervision signals of the contrastive learning paradigm are usually generated by a contrastive loss [19], such as InfoNCE [22], SimCLR [40], MoCo [21], and OTM [23]. These contrastive losses have been widely used in recent self-supervised action recognition methods. For example, CrosSCLR [16] and SkeletonMixCLR [17] achieve promising performance by adopting MoCov2 [41]. AimCLR [15] adopts InfoNCE [22] as a contrastive loss to improve the model performance on downstream tasks. In this paper, we use OTM [23] to implement our DMMG.
II-B Skeleton Data Augmentation Strategy
Constructing contrastive pairs via skeleton data augmentation is a critical component in self-supervised action recognition tasks under a contrastive learning framework. Most contrastive learning-based self-supervised action recognition methods focus on developing various data augmentation strategies to construct hard contrastive pairs, and the harder contrastive pairs would promote learning more representative action features. For example, MCC [20] adopts a speed-changed operation combined with a random start frame to build hard positive samples. CrosSCLR [16] adopts Shear and Crop data augmentation, which has become the most commonly used skeleton data augmentation strategies in self-supervised action recognition works [16, 17, 15].
To explore more discriminative action feature representations. AimCLR [15] adopts more hand-crafted data augmentation approaches, such as Spatial Flipping, Rotation, Axis Masking, Temporal Flipping, to construct hard contrastive pairs. Furthermore, SkeletonMixCLR [17] develops a stronger skeleton data augmentation method by randomly mixing different body parts (e.g., left hands, right legs, trunk) among different skeleton sequences. Although these hand-crafted data augmentation methods can create effective contrastive pairs, their random data augmentation strategies may construct sufficient challenging contrastive pairs. On the contrary, our DMMG adopts an adversarial learning technique to optimize the data augmentation strategy and thus can construct a large number of non-misleading yet challenging contrastive pairs.
III Preliminary
In this section, we introduce the preliminary concepts and definitions of our DMMG. Here, we denote a 3D skeleton sequence as , which contains frames with joints, and each joint has dimensions, i.e., coordinates. Then, we represent a graph structure , where is a node and is an edge. The node features are 3D coordinates of skeleton joints, and we use an adjacency matrix to represent graph-structured body joints. The adjacency matrix is denoted by , whose element associates with the edge . The conventional GCN-based models generally only adopt the node features as input, and the adjacency matrix is applied to hidden layers. To implement our DMMG, we modify the GCN-based model which can take both node features and adjacency matrix as inputs, formulated as . We aim to learn a model for further use in downstream tasks.
III-A Viewpoint Variation Min-Max Game
In this viewpoint variation min-max game, we first define a skeleton augmenter . takes as inputs and generates . The minimization target is to minimize the mutual information [42] between the feature representations and by training the skeleton augmenter . It can be formulated as:
[TABLE]
where denotes the mutual information between two feature representations.
This maximization game is to maximize the mutual information between the feature representations and by training the model , and it can be expressed as follows:
[TABLE]
Combining Eq. (1) and Eq. (2), our viewpoint variation min-max game is formulated as:
[TABLE]
III-B Edge Perturbation Min-Max Game
In the edge perturbation min-max game, we first present a learnable graph augmenter with inputs and . For an adjacency matrix , denotes the perturbed adjacency matrix, where remains the same connection but has different weights for the edge . Here, the minimization game aims to minimize the mutual information between the feature representations and by training the graph augmenter , defined by:
[TABLE]
The maximization game is to maximize the mutual information between the feature representations and by training the model , formulated as:
[TABLE]
Combining Eq. (4) and Eq. (5), our edge perturbation min-max game is defined as:
[TABLE]
IV Methodology
In this section, we first introduce the data streams and model used in our DMMG, and then present the details of our two min-max games (e.g., the details of key modules, principles and algorithms) as well as the contrastive learning framework to implement our DMMG.
IV-A Model Details
IV-A1 Data Stream of DMMG
Skeleton-based data can be easily converted into various data streams [43]. We utilize the joint and motion streams as inputs. The joint stream is noted as and the motion stream is represented as the joint displacement between frames: . For constructing contrastive pairs, we consider as a positive pair in our viewpoint variation min-max game, and any other samples are regarded as negative pairs. In our edge perturbation min-max game, we define as a positive pair, and any other samples serve as negative pairs.
IV-A2 Model
In this work, we employ ST-GCN [44] as the model. ST-GCN can explore the spatio-temporal information in 3D skeleton sequences, and it is widely used in self-supervised 3D skeleton action recognition methods [16, 17, 15]. The GCN based model embeds with the pre-defined adjacency matrix into a hidden space . In particular, the GCNs can be represented with a simple form [45]:
[TABLE]
where ; is the output feature from the -th GCN layer, when and ; denotes a nonlinear function, i.e., ReLu function; is the weights matrix in the -th GCN layer. We can substitute the original with augmented , and (7) can be modified as:
[TABLE]
The output features can be extracted by several GCN layers, represented as . Then, the GCN model is followed by an average pooling operation on the spatio-temporal dimension, and the action feature is finally formulated as .
IV-B Dual Min-Max Games
IV-B1 Viewpoint Variation Min-Max Game
As shown in Fig. 3, in the skeleton augmenter , a skeleton sequence is firstly fed into a multi-layer perceptron (MLP): with trainable parameters , and outputs a normalized quaternion [46]. Then, is used for rotating the skeleton data and generating the augmented skeleton data . This quaternion rotation operation has the same effect as changing the viewpoints. We denote such a quaternion rotation operation as and the augmented skeleton sequence is generated by . Note that in the viewpoint variation min-max game, we use the original adjacency matrix for the original and augmented skeleton data to construct the contrastive pair: . They are used as inputs for GCN model to learn action features. In the minimization game, we train for generating harder contrastive pairs by minimizing the mutual information between the original and augmented action features. Noted that the parameter of GCN model are fixed. Eq. (1) can be modified as:
[TABLE]
In the maximization game, we train the GCN model to maximize the mutual information between the hard contrastive pairs. The parameter of is fixed in the maximization game, and Eq. (2) is revised as:
[TABLE]
Combining Eq. (9) with Eq. (10), our viewpoint variation min-max game is formally represented as:
[TABLE]
IV-B2 Edge Perturbation Min-Max Game
As shown in Fig. 4, the graph augmenter takes and as the inputs. Firstly, is fed into an MLP: with trainable parameters , and outputs a vector , where is the number of pre-defined edges in the adjacency matrix and is restricted in the range from 0 to 1 by a Sigmoid activation function. Then, is multiplied with all non-zero elements in . Here, we denote this element-wise multiplication as , and thus we can obtain augmented graph-structure based body joints by . In the minimization game, we train to generate more challenging contrastive pairs from the graph-structured data perspective. We minimize the mutual information between the original and augmented graph-structured body joints embedded by GCN model , where the parameter of the model is fixed. We revised Eq. (4) as:
[TABLE]
In the maximization game, we train the model to maximize the mutual information between hard contrastive pairs. Different from the minimization game, we fix the parameter of and Eq. (5) is revised as:
[TABLE]
Combining Eq. (12) and Eq. (13), we finally formulate our edge perturbation min-max game as:
[TABLE]
IV-C Learning Framework of DMMG
As shown in Fig. 2, our minimization games first lead the data augmentation modules to generate harder contrastive samples by minimizing the mutual information between the learned features from the original and augmented data. Then, in our maximization games, we utilize a contrastive loss to drive the model to learn feature representations by pulling the positive pairs closer and forcing the negative pairs away in embedding space. From the maximization games, as shown in Fig. 3 and Fig. 4, the two models and encode the original data and augmented data, where the parameter of in is momentum updated [21] of , and is the momentum coefficient. Then, a projector and its momentum updated version project the hidden spaces into a lower dimension space: and , where the projector is a fully connected layer with a ReLU activation function.
In Eq. (11) and Eq. (14), we utilize the term mutual information [42] to evaluate the similarity between different features. To estimate the similarity, we adopt the online triplet mining loss [47] as the estimator, which is frequently used in contrastive learning tasks [47, 48, 49]. During the training process, we set a minibatch of samples, let and , where and denote the original and augmented data respectively. Here, we define the feature learned by any original data as the anchor sample, the augmented feature as the positive sample, and any other original feature as the negative sample.
[TABLE]
where is the margin between positive and negative pairs, and denotes the standard Hinge loss [23]. Generally, selecting the hard contrastive pairs to manipulate triplet loss significantly improves the model generalization on downstream tasks [47]. Therefore, during training, we select hard positive sample features and hard negative sample features from a minibatch by computing and . In the minimization games, we calculate the minimizing information loss by . In the maximization games, we define the maximizing information loss as , where is the hyperparameter for regularizing the loss value.
In the training stage, we implement our two min-max games alternatively in the same minibatch. For example, we first implement the viewpoint variation min-max game and then the edge perturbation min-max game.
V Experiments
To evaluate the effectiveness of our DMMG, we implement experiments on two benchmark 3D skeleton-based action recognition datasets: NTU RGB+D 60 [31], and NTU RGB+D 61-120 [32], with a minibatch size of 32. For optimization, we train the model on the PyTorch framework with an SGD optimizer with an initial learning rate of 0.1, weight decay at 0.0001, and the momentum is 0.9.
V-A Datasets
- •
NTU RGB+D 60 (NTU-60) [31]: It is a large scale dataset for 3D skeleton based action recognition, which is recorded by Kinect V2 sensors and each skeleton graph is depicted by joints. In detail, this dataset contains 56,880 3D skeleton sequences from 40 different performers. These 3D skeleton sequences cover 60 daily actions, including single-person actions, human-objective, and human-human interactions. We evaluate our methods by two evaluation metrics: cross-subject (Xsub: skeleton sequences with 20 specific subject IDs are used for training and the remaining samples for testing) and cross-view (Xview: skeleton sequences from camera 2 and 3 for training while the other samples from camera 1 for testing).
- •
NTU RGB+D 120 (NTU-120) [32]: This dataset is the extension of NTU RGB+D 60 dataset by a number of performers and action classes, whose scale expands to 113,945 skeleton sequences covering 120 daily action classes. In detail, this dataset contains skeleton sequences of 106 different performers of a wide range of ages and it covers 155 camera views in 32 scenes. Two recommended evaluation metrics are used in this dataset: cross-subject (Xsub: skeleton sequences with 53 specific subject IDs are used for training and the remaining samples for testing) and cross-setup (Xset: skeleton sequences with even IDs are used for training while the remaining odd IDs for testing).
- •
NTU RGB+D 61-120 (NTU-61-120) [32]: This dataset is the subset of NTU RGB+D 120, which contains 57,367 3D skeleton sequences covering the last 60 action classes in NTU RGB+D 120. The action classes in NTU RGB+D 61-120 have no intersection with the ones in NTU RGB+D 60. This dataset serves as an external dataset for evaluating the transfer capability of our DMMG.
V-B Evaluation Protocols
We evaluate our DMMG under four evaluation protocols, including linear, finetune, KNN, and semi-supervised evaluation protocols.
- •
Linear Evaluation Protocol. This protocol appends a fully connected layer with a Softmax activation function after a frozen pre-trained model and then uses a fully supervised method to train the classifier. We train the model for 80 epochs using an Adam optimizer with an initial learning rate of 0.001, which will be multiplied by 0.1 at epoch 60.
- •
Finetune Protocol. This protocol appends a linear classifier after a pre-trained model. Different from the linear evaluation protocol, the pre-trained model is trainable. We train the whole model using supervised learning.
- •
KNN Evaluation Protocol. This protocol utilizes a k-nearest neighbor (KNN) classifier without trainable parameters to evaluate the quality of the action features encoded by the model. We set K = 20, and the temperature parameter is 0.1 in the KNN evaluation protocol.
- •
Semi-Supervised Evaluation Protocol. In this protocol, we train the whole model under the finetune protocol by only using 1% and 10% randomly sampled labeled data, respectively.
V-C Comparison with State-of-the-Art Methods
V-C1 Comparisons with Benchmark CLRs
We conduct our experiments on NTU-60 and NTU-120 datasets to compare our DMMG with four benchmarking contrastive learning-based self-supervised action recognition methods (e.g., SkeletonCLR [16], CrosSCLR [16], AimCLR [15], and SkeleMixCLR [17]). Noted that the listed methods in Table I utilize ST-GCN [44] as the backbone, and DMMG denotes the model is pre-trained on NTU-61-120. SkeletonCLR is regarded as the baseline of the contrastive learning-based self-supervised action recognition method. It augments the skeleton sequences by using Shear and Crop. SkeletonCLR considers the augmented skeleton sample as a positive sample and any other sample as a negative sample [16].
As shown in Table I, our DMMG achieves the best results on various streams, and our DMMG also achieves promising results. In addition, our DMMG further boosts the model’s performance by only using the motion stream. This is because the motion representations are highly dependent on viewpoint variations. Our viewpoint variation min-max game augments the skeleton data with diverse motion information, and thus learns more representative skeleton motion features. Compared with three stream fusion methods, 3s-AimCLR [15] and 3s-SkeleMixCLR [17], our two stream fusion based DMMG (2s-DMMG) achieves higher accuracy by only using joint and motion streams. The experimental results demonstrate the effectiveness of our dual min-max games, which substantially improve the model’s performance on action recognition task.
V-C2 Linear Evaluation Protocol Results
Table II and Table III demonstrate the results compared with other state-of-the-art methods under linear evaluation protocol on NTU-60 and NTU-120 datasets, respectively. Here, we compare our method with the approaches that use various deep models as backbones.
From the single-stream perspective, it can be observed that there is a significant gap in accuracy between GCN-based methods (e.g., ISC [53]) and other model-based methods (e.g., LSTM-based AS-CAL [51], GRU-based CRRL [52]). Compared with the GCN-based methods, our DMMG achieves the best results. From the multi-stream perspective, our 2s-DMMG further improves the performance and outperforms many other multi-stream methods, such as 3s-AimCLR [15], 3s-CrosSCLR [16], 3s-SkeletonCLR [16], and 3s-Colorization [33], on both NTU-60 and NTU-120 datasets. The superior performance of our 2s-DMMG on small-scale and large-scale datasets (NTU-60 and NTU-120) demonstrates the effectiveness and generalization of our dual min-max games.
V-C3 Finetune Protocol Results
Table V shows the results comparison of our methods with other benchmark methods under the finetune protocol. Noted that the ST-GCN is trained with fully supervision. As shown in Table V, our 2s-DMMG achieves the best results. Compared to the results under the linear evaluation protocol, the results under finetune protocol show a significant improvement. Our 2s-DMMG performs slightly better than 3s-SkeleMixCLR under the finetune protocol. This is because utilizing various data streams can contribute more to model performance improvement, but using more data streams comes with a higher computational cost. Our dual min-max games can boost performance with a relatively lower computational load.
V-C4 KNN Evaluation Protocol Results
As shown in Table VI, our DMMG achieves better results than SkeletonCLR, AimCLR, and SkeleMixCLR on both datasets. Notably, our DMMG achieves a significant improvement in action recognition accuracy on Xview metrics in the NTU-60 dataset. The results under the KNN evaluation protocol demonstrate that our viewpoint-variant game can guide the model to learn more discriminative feature representations from various viewpoints.
V-C5 Semi-Supervised Protocol Results
To ensure each action class has roughly equal representation in the training samples, we randomly sample 1% and 10% labeled data from each action class, respectively. As shown in Table IV, our 2s-DMMG outperforms other methods. This indicates that our dual min-max games can make full use of spatio-temporal information to enable the model to perform better on the downstream task.
V-C6 Qualitative Results
We use t-SNE [54] to visualize the action features learned by the pre-trained model at 50, 100, 200, 300 epochs. For a fair comparison, we randomly select ten action categories for feature visualization. The feature visualization results in Fig 5 show that our DMMG makes the action feature representations of the same category more clustered and distinguishable as training proceeds. The feature representations of the motion stream are discriminative, leading to the promising results in Table I. The qualitative results demonstrate that our DMMG can learn more discriminative features and thus boost the model’s performance on downstream tasks.
V-D Ablation Study
We conduct five ablation studies to verify the effectiveness of different components of our DMMG. Here, we denote the viewpoint variant min-max game as VMMG, and use EMMG to represent the edge perturbation min-max game.
V-D1 Effectiveness of the Min-Max Game Strategy
To evaluate the effectiveness of our min-max game strategy, we design two random data augmentation strategies to substitute the two min-max games. Firstly, we randomly change the viewpoints to augment the skeleton sequence. We denote such a random viewpoint variation augmentation strategy as R-View. Secondly, we randomly perturb the connectivity strengths among graph-based body joints to augment graph-based body joints, and denote this random edge perturbation augmentation strategy as R-Edge. Finally, we use D-RR to represent the combination of R-View and R-Edge.
From Fig 6, we can observe that DMMG outperforms D-RR, and the min-max game strategies (VMMG and EMMG) have a better performance than both R-View and R-Edge. This verifies the adversarial paradigm enables our DMMG to construct more challenging contrastive pairs. In addition, we can find that the performance of R-Edge is worse than R-View. This is because a purely random edge perturbation strategy may drop connections between key joints, which may cause the model cannot capture critical action information.
V-D2 Effectiveness of Dual Min-Max Games
To evaluate the effectiveness of our two min-max games, we train the model using only one min-max game. For a fair comparison, we adopt the same data streams, training settings and evaluation protocol as in Table I, and we conduct the experiments on the NTU-60 dataset. As shown in Figure 8, we can observe that VMMG achieves higher accuracy than EMMG when using the motion stream, while EMMG outperforms VMMG when only using the joint stream. This verifies that our two min-max games can improve the model’s performance from different data streams, and combining the two min-max games can further improve the model’s performance.
V-D3 Transfer Ability
To evaluate the transfer ability of our DMMG, we first conduct experiments on the NTU-61-120 dataset, then transfer the pre-trained model on NTU-60 for linear and finetune evaluation. As shown in Table I, we can find that our DMMG achieves better results than many other methods on the NTU-60 dataset. Especially under Xview protocol, DMMG has comparable performance with DMMG. In Table V, our DMMG achieves better performance than 3s-CrosSCLR [16]. The experimental results demonstrate the strong transfer ability of our DMMG.
V-D4 Comparisons of Different GCN Models
To evaluate the effectiveness of our DMMG on different GCN models, we conduct experiments by using different GCN models on NTU-60 and NTU-120 datasets. Here, we employ 2s-ASGCN [5] as the candidate GCN model. As one of the most robust GCN models, 2s-ASGCN achieves state-of-the-art performance on both NTU-60 and NTU-120 datasets.
As shown in Table VII, our 2s-DMMG with the ASGCN model achieves better results than 2s-DMMG using the STGCN model under both Linear and Finetune evaluation protocols. This means our DMMG can achieve better performance on downstream tasks when using a more robust GCN model. Compared with 2s-ASGCN under a fully-supervised training manner, our 2s-DMMG has slightly improved performance under the finetune protocol on NTU-60. This is because the 2s-ASGCN can learn sufficiently discriminative action features and achieve excellent performance on NTU-60. As a result, our pre-trained model can only achieve limited improvement.
V-D5 Visualization of DMMG
We visualize the original data and augmented data from epochs 50, 100, 200, and 300. As shown in Fig 7, the difference between the augmented data and original data increases with training epochs, and the viewpoint gradually becomes more challenging in the viewpoint variant min-max game. As indicated by the results of EMMG, we can observe that more connectivity strengths among graph-based body joints decrease with increasing training epochs.
Especially, in epochs 200 to 300, some key joint connection values are close to zero, indicating that some unnecessary connections are ignored. The visualization of augmented data by our DMMG demonstrates the effectiveness of the min-max game strategy in producing challenging contrastive pairs. These contrastive pairs can help the model learn more discriminative action feature representations, thus improving its performance on downstream tasks.
VI Conclusion
In this paper, we proposed Dual Min-Max Games (DMMG) based data augmentation for a self-supervised skeleton-based action recognition task under a contrastive learning framework. Our DMMG has a viewpoint variation min-max game and an edge perturbation min-max game. They perform augmentation on the skeleton sequences and graph-structured body joints, respectively. As a result, we obtain sufficient challenging contrastive pairs with our DMMG. These hard contrastive pairs help the model to learn more representative and discriminative feature representations while avoiding overfitting. Thus, our method significantly boosts the performance of our model on downstream tasks. The extensive experimental results demonstrate the effectiveness of our DMMG and show that our DMMG achieves state-of-the-art performance under various evaluation protocols.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] L. Wang, D. Q. Huynh, and P. Koniusz, “A comparative review of recent kinect-based action recognition algorithms,” IEEE Transactions on Image Processing , vol. 29, pp. 15–28, 2020.
- 2[2] Z. Cao, T. Simon, S.-E. Wei, and Y. Sheikh, “Realtime multi-person 2d pose estimation using part affinity fields,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2017, pp. 7291–7299.
- 3[3] Y. Xue, J. Chen, X. Gu, H. Ma, and H. Ma, “Boosting monocular 3d human pose estimation with part aware attention,” IEEE Transactions on Image Processing , vol. 31, pp. 4278–4291, 2022.
- 4[4] J. Martinez, R. Hossain, J. Romero, and J. J. Little, “A simple yet effective baseline for 3d human pose estimation,” in Proceedings of the IEEE international conference on computer vision , 2017, pp. 2640–2649.
- 5[5] L. Shi, Y. Zhang, J. Cheng, and H. Lu, “Two-stream adaptive graph convolutional networks for skeleton-based action recognition,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , 2019, pp. 12 026–12 035.
- 6[6] Y. Du, W. Wang, and L. Wang, “Hierarchical recurrent neural network for skeleton based action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 1110–1118.
- 7[7] P. Zhang, C. Lan, J. Xing, W. Zeng, J. Xue, and N. Zheng, “View adaptive recurrent neural networks for high performance human action recognition from skeleton data,” in Proceedings of the IEEE international conference on computer vision , 2017, pp. 2117–2126.
- 8[8] Q. Ke, M. Bennamoun, S. An, F. Sohel, and F. Boussaid, “A new representation of skeleton sequences for 3d action recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition , 2017, pp. 3288–3297.
