Spatial-temporal Transformer-guided Diffusion based Data Augmentation   for Efficient Skeleton-based Action Recognition

Yifan Jiang; Han Chen; Hanseok Ko

arXiv:2302.13434·cs.CV·July 26, 2023

Spatial-temporal Transformer-guided Diffusion based Data Augmentation for Efficient Skeleton-based Action Recognition

Yifan Jiang, Han Chen, Hanseok Ko

PDF

Open Access

TL;DR

This paper presents a novel data augmentation approach for skeleton-based action recognition using diffusion models guided by a spatial-temporal transformer, significantly improving model performance with synthetic data.

Contribution

It introduces a new diffusion-based data augmentation method guided by a spatial-temporal transformer for generating realistic skeleton action sequences.

Findings

01

Outperforms state-of-the-art motion generation methods

02

Synthetic data improves action recognition accuracy

03

Generates diverse and natural action sequences

Abstract

Recently, skeleton-based human action has become a hot research topic because the compact representation of human skeletons brings new blood to this research domain. As a result, researchers began to notice the importance of using RGB or other sensors to analyze human action by extracting skeleton information. Leveraging the rapid development of deep learning (DL), a significant number of skeleton-based human action approaches have been presented with fine-designed DL structures recently. However, a well-trained DL model always demands high-quality and sufficient data, which is hard to obtain without costing high expenses and human labor. In this paper, we introduce a novel data augmentation method for skeleton-based action recognition tasks, which can effectively generate high-quality and diverse sequential actions. In order to obtain natural and realistic action sequences, we propose…

Tables5

Table 1. Table 1: Naturality and diversity evaluation on the HumanAct12 dataset. Acc. is action recognition accuracy, O. Div. is overall diversity, and PA. Div. is per-action diversity. (The best evaluation score is marked in bold. ↑ ↑ \uparrow means a higher number is better, ↓ ↓ \downarrow indicates a lower number is better, and → → \rightarrow means the number closer to Real actions is better. ± plus-or-minus \pm indicates 95% confidence interval.)

Methods	FID ( $↓$ )	Acc. ( $↑$ )	O. Div. ( $\to$ )	PA. Div. ( $\to$ )
Real actions	${0.09}^{\pm 0.01}$	${0.96}^{\pm 0.01}$	${6.74}^{\pm 0.03}$	${2.55}^{\pm 0.02}$
CondGRU shlizerman2018audio ; guo2020action2motion	${39.92}^{\pm 0.13}$	${0.06}^{\pm 0.03}$	${2.05}^{\pm 0.05}$	${2.18}^{\pm 0.02}$
Two-stage GAN cai2018deep ; guo2020action2motion	${12.08}^{\pm 0.11}$	${0.45}^{\pm 0.01}$	${5.35}^{\pm 0.06}$	${2.21}^{\pm 0.03}$
Act-MoCoGAN tulyakov2018mocogan ; guo2020action2motion	${5.73}^{\pm 0.18}$	${0.77}^{\pm 0.01}$	${6.84}^{\pm 0.04}$	${1.26}^{\pm 0.02}$
Action2Motion guo2020action2motion	${2.66}^{\pm 0.09}$	${0.91}^{\pm 0.01}$	${6.98}^{\pm 0.03}$	${2.88}^{\pm 0.01}$
ACTOR petrovich2021action	${0.24}^{\pm 0.03}$	${0.93}^{\pm 0.01}$	${6.62}^{\pm 0.05}$	${2.49}^{\pm 0.03}$
Ours	${0.12}^{\pm 0.01}$	${0.95}^{\pm 0.01}$	${6.88}^{\pm 0.02}$	${2.50}^{\pm 0.02}$

Table 2. Table 2: Replacement data augmentation experiment on the NTU RGB+D 120 dataset. 0%-50% means replacing real training data using synthetic data from the proposed method (or Action2Motion and ACTOR) with different proportions. Results are reported as ( C r o s s − S u b j e c t C r o s s − S e t u p ) matrix 𝐶 𝑟 𝑜 𝑠 𝑠 𝑆 𝑢 𝑏 𝑗 𝑒 𝑐 𝑡 𝐶 𝑟 𝑜 𝑠 𝑠 𝑆 𝑒 𝑡 𝑢 𝑝 (\begin{matrix}Cross-Subject\\ Cross-Setup\end{matrix}) accuracy. (The best evaluation score is marked in bold. ± plus-or-minus \pm indicates 95% confidence interval.)

Methods	0%	10%	20%	30%	40%	50%	50% (A2M)	50% (ACTOR)
MS-G3D liu2020disentangling	${71.4}^{\pm 0.0}$ ${72.0}^{\pm 0.0}$	${70.5}^{\pm 0.2}$ ${71.2}^{\pm 0.2}$	${74.6}^{\pm 0.1}$ ${76.9}^{\pm 0.2}$	${75.4}^{\pm 0.1}$ ${76.9}^{\pm 0.2}$	${73.3}^{\pm 0.2}$ ${73.9}^{\pm 0.2}$	${70.9}^{\pm 0.4}$ ${71.1}^{\pm 0.3}$	${62.7}^{\pm 0.2}$ ${64.2}^{\pm 0.3}$	${66.8}^{\pm 0.3}$ ${67.1}^{\pm 0.3}$
EfficientGCN-B4 song2022constructing	${72.2}^{\pm 0.0}$ ${72.6}^{\pm 0.0}$	${72.1}^{\pm 0.1}$ ${72.3}^{\pm 0.2}$	${75.8}^{\pm 0.3}$ ${77.2}^{\pm 0.1}$	${75.6}^{\pm 0.2}$ ${77.0}^{\pm 0.1}$	${74.1}^{\pm 0.1}$ ${74.8}^{\pm 0.2}$	${72.0}^{\pm 0.3}$ ${73.1}^{\pm 0.2}$	${63.6}^{\pm 0.3}$ ${64.9}^{\pm 0.3}$	${67.0}^{\pm 0.2}$ ${67.8}^{\pm 0.4}$
CTR-GCN chen2021channel	${72.5}^{\pm 0.0}$ ${73.4}^{\pm 0.0}$	${73.2}^{\pm 0.1}$ ${73.9}^{\pm 0.1}$	${76.4}^{\pm 0.2}$ ${77.6}^{\pm 0.1}$	${76.9}^{\pm 0.1}$ ${77.4}^{\pm 0.1}$	${76.1}^{\pm 0.2}$ ${77.7}^{\pm 0.1}$	${72.5}^{\pm 0.3}$ ${73.5}^{\pm 0.3}$	${63.9}^{\pm 0.4}$ ${64.8}^{\pm 0.2}$	${69.3}^{\pm 0.3}$ ${70.0}^{\pm 0.4}$

Table 3. Table 3: Incremental data augmentation experiment on the NTU RGB+D 120 dataset. 0%-50% means adding extra synthetic data from the proposed method (or Action2Motion and ACTOR) into real training data with different proportions. Results are reported as ( C r o s s − S u b j e c t C r o s s − S e t u p ) matrix 𝐶 𝑟 𝑜 𝑠 𝑠 𝑆 𝑢 𝑏 𝑗 𝑒 𝑐 𝑡 𝐶 𝑟 𝑜 𝑠 𝑠 𝑆 𝑒 𝑡 𝑢 𝑝 (\begin{matrix}Cross-Subject\\ Cross-Setup\end{matrix}) accuracy. (The best evaluation score is marked in bold. ± plus-or-minus \pm indicates 95% confidence interval.)

Methods	0%	10%	20%	30%	40%	50%	50% (A2M)	50% (ACTOR)
MS-G3D liu2020disentangling	${71.4}^{\pm 0.0}$ ${72.0}^{\pm 0.0}$	${71.6}^{\pm 0.1}$ ${72.3}^{\pm 0.1}$	${72.5}^{\pm 0.1}$ ${73.0}^{\pm 0.2}$	${74.2}^{\pm 0.2}$ ${74.9}^{\pm 0.3}$	${75.6}^{\pm 0.1}$ ${76.0}^{\pm 0.3}$	${75.5}^{\pm 0.1}$ ${76.2}^{\pm 0.1}$	${65.6}^{\pm 0.3}$ ${65.8}^{\pm 0.4}$	${69.9}^{\pm 0.3}$ ${70.4}^{\pm 0.2}$
EfficientGCN-B4 song2022constructing	${72.2}^{\pm 0.0}$ ${72.6}^{\pm 0.0}$	${73.0}^{\pm 0.1}$ ${73.6}^{\pm 0.1}$	${73.3}^{\pm 0.1}$ ${73.5}^{\pm 0.1}$	${76.8}^{\pm 0.2}$ ${77.9}^{\pm 0.2}$	${76.7}^{\pm 0.3}$ ${77.3}^{\pm 0.2}$	${76.5}^{\pm 0.1}$ ${77.0}^{\pm 0.3}$	${67.0}^{\pm 0.3}$ ${67.8}^{\pm 0.4}$	${71.8}^{\pm 0.2}$ ${72.5}^{\pm 0.4}$
CTR-GCN chen2021channel	${72.5}^{\pm 0.0}$ ${73.4}^{\pm 0.0}$	${73.0}^{\pm 0.2}$ ${73.7}^{\pm 0.1}$	${74.6}^{\pm 0.1}$ ${75.3}^{\pm 0.1}$	${77.4}^{\pm 0.3}$ ${78.9}^{\pm 0.2}$	${77.6}^{\pm 0.4}$ ${79.4}^{\pm 0.2}$	${77.4}^{\pm 0.2}$ ${79.1}^{\pm 0.3}$	${67.9}^{\pm 0.3}$ ${69.0}^{\pm 0.3}$	${72.2}^{\pm 0.4}$ ${72.6}^{\pm 0.2}$

Table 4. Table 4: Ablation studies of different guiding models and hyper-parameters. 50% replacement experiment settings are used. CS: Cross-Subject accuracy, CP: Cross-Setup accuracy. ( ± plus-or-minus \pm indicates 95% confidence interval.)

Method	FID ( $↓$ )	CS ( $↑$ )	CP ( $↑$ )
Ours (ST-Trans + 1000 steps + Cosine)	${0.39}^{\pm 0.01}$	${72.5}^{\pm 0.3}$	${73.5}^{\pm 0.3}$
w/o Guidance	${39.83}^{\pm 0.26}$	-	-
CNNs tan2019efficientnet	${3.49}^{\pm 0.02}$	${61.8}^{\pm 0.5}$	${63.4}^{\pm 0.3}$
Unet dhariwal2021diffusion	${1.66}^{\pm 0.03}$	${70.7}^{\pm 0.3}$	${71.2}^{\pm 0.4}$
CLIP radford2021learning	${2.92}^{\pm 0.04}$	${68.2}^{\pm 0.3}$	${68.8}^{\pm 0.3}$
100 steps	${12.50}^{\pm 0.13}$	${66.4}^{\pm 0.2}$	${66.9}^{\pm 0.1}$
500 steps	${1.46}^{\pm 0.01}$	${69.2}^{\pm 0.2}$	${70.1}^{\pm 0.3}$
2500 steps	${0.41}^{\pm 0.01}$	${72.5}^{\pm 0.4}$	${73.3}^{\pm 0.3}$
Linear	${0.40}^{\pm 0.01}$	${72.6}^{\pm 0.5}$	${72.9}^{\pm 0.4}$

Table 5. Table 5: Ablation studies of the effectiveness of various guiding strategies. ( ± plus-or-minus \pm indicates 95% confidence interval.)

Method	Model size ( $↓$ )	Convergence iterations ( $↓$ )
Ours (ST-Trans + Clean data)	196M	$463 K^{\pm 2.6 K}$
Unet dhariwal2021diffusion + Noisy data	223M	$533 K^{\pm 2.0 K}$
CNNs tan2019efficientnet + Clean data	219M	$480 K^{\pm 3.1 K}$
ViT dosovitskiy2020image + Clean data	240M	$561 K^{\pm 4.9 K}$
CLIP radford2021learning + Clean data	261M	$249 K^{\pm 3.3 K}$

Equations13

q (x_{t} ∣ x_{t - 1}) = N (x_{t}; 1 - β_{t} x_{t - 1}, β_{t} I)

q (x_{t} ∣ x_{t - 1}) = N (x_{t}; 1 - β_{t} x_{t - 1}, β_{t} I)

j o in tl y q (x_{1 : T} ∣ x_{0}) = t = 1 \prod T q (x_{t} ∣ x_{t - 1})

q (x_{t} ∣ x_{0}) = N (x_{t}; \overset{α}{ˉ}_{t} x_{0}, (1 - \overset{α}{ˉ}_{t}) I)

q (x_{t} ∣ x_{0}) = N (x_{t}; \overset{α}{ˉ}_{t} x_{0}, (1 - \overset{α}{ˉ}_{t}) I)

S am pl in g x_{t} = \overset{α}{ˉ}_{t} x_{0} + (1 - \overset{α}{ˉ}_{t}) ϵ

p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), σ_{t}^{2} I)

p_{θ} (x_{t - 1} ∣ x_{t}) = N (x_{t - 1}; μ_{θ} (x_{t}, t), σ_{t}^{2} I)

p_{θ} (x_{0 : T}) = p (x_{T}) t = 1 \prod T p_{θ} (x_{t - 1} ∣ x_{t})

μ_{θ} (x_{t}, t) = \frac{1}{1 - β _{t}} (x_{t} - \frac{β _{t}}{1 - α ˉ _{t}} ϵ_{θ} (x_{t}, t))

μ_{θ} (x_{t}, t) = \frac{1}{1 - β _{t}} (x_{t} - \frac{β _{t}}{1 - α ˉ _{t}} ϵ_{θ} (x_{t}, t))

\overset{x}{^}_{0} = \frac{1}{α ˉ _{t}} [x_{t} - 1 - \overset{α}{ˉ}_{t} ϵ_{θ} (x_{t}, t)]

\overset{x}{^}_{0} = \frac{1}{α ˉ _{t}} [x_{t} - 1 - \overset{α}{ˉ}_{t} ϵ_{θ} (x_{t}, t)]

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHuman Pose and Action Recognition · Gait Recognition and Analysis · Anomaly Detection Techniques and Applications

MethodsDiffusion

Full text

∎

11institutetext: Yifan Jiang,

11email: [email protected]

Han Chen,

11email: [email protected]

Hanseok Ko (corresponding author),

11email: [email protected] 22institutetext: School of Electrical Engineering, Korea University, Seoul 02841, South Korea

Spatial-temporal Transformer-guided Diffusion based Data Augmentation for Efficient Skeleton-based Action Recognition

Yifan Jiang

Han Chen

Hanseok Ko

(Received: date / Accepted: date)

Abstract

Recently, skeleton-based human action has become a hot research topic because the compact representation of human skeletons brings new blood to this research domain. As a result, researchers began to notice the importance of using RGB or other sensors to analyze human action by extracting skeleton information. Leveraging the rapid development of deep learning (DL), a significant number of skeleton-based human action approaches have been presented with fine-designed DL structures recently. However, a well-trained DL model always demands high-quality and sufficient data, which is hard to obtain without costing high expenses and human labor. In this paper, we introduce a novel data augmentation method for skeleton-based action recognition tasks, which can effectively generate high-quality and diverse sequential actions. In order to obtain natural and realistic action sequences, we propose denoising diffusion probabilistic models (DDPMs) that can generate a series of synthetic action sequences, and their generation process is precisely guided by a spatial-temporal transformer (ST-Trans). Experimental results show that our method outperforms the state-of-the-art (SOTA) motion generation approaches on different naturality and diversity metrics. It proves that its high-quality synthetic data can also be effectively deployed to existing action recognition models with significant performance improvement.

Keywords:

Denoising Diffusion Probabilistic Models Skeleton-based Action Recognition Data Augmentation Image Synthesis

1 Introduction

Human action recognition is crucial in various video-based visual applications, for instance, video surveillance, video understanding, and human-computer interaction (HCI) kong2022human ; liu2021no ; lou2019ar . Different sensors have been considered, such as RGB frames fayyaz20213d ; duan2020omni ; li2021ct , depth maps wang2017structured ; wang2018depth ; sanchez20223dfcnn , thermal images imran2020evaluating ; mehta2021motion , and human skeleton yan2018spatial ; shi2019two ; liu2020disentangling ; cheng2020skeleton ; song2022constructing ; li2021symbiotic ; li20213d . Among these modalities, the human skeleton is currently gaining growing attention because of its high compactness and robustness. In practice, the representation of the human skeleton is usually mapped by a time series of 3D coordinate sequences, which can be extracted by pose estimation approaches. Therefore, only the pure skeleton information is included, and it is naturally much more robust to the variation of camera angle, illumination, and background.

While the results of existing works are encouraging, there are always not easy to train a well-performed skeleton-based action recognition model due to data scarcity. In practice, the data scarcity problem is expected due to the high expense of the motion capture and labeling process, but it actually negatively impacts action recognition performance. Therefore, discussing data augmentation for skeleton-based action recognition is meaningful and imperative.

Over the past few years, generative models, represented by generative adversarial networks (GANs) goodfellow2014generative , have shown their superiority in different visual tasks, for example, photo-realistic image synthesis park2019semantic ; zhu2020sean , text-to-image generation hinz2020semantic ; zhu2019dm , medical image analysis chen2022unsupervised ; jiang2020covid , image enhancement kim2020unsupervised ; park2019adaptive and image manipulation kim2022style ; couairon2022flexit ; kwak2020cafe . Nevertheless, GANs’ synthetic data suffer from the lack of diversity nichol2021improved and low stability when conducting the training process with complex hyper-parameter settings brock2016neural ; brock2018large . More recently, researchers found that the diffusion models ho2020denoising can generate realistic images with high quality. Compared to GANs-based image synthesis approaches, diffusion models are a series of likelihood-based architectures with many advantages: a steady training scheme, better flexibility, and domain adaptive capability nichol2021improved ; dhariwal2021diffusion ; nichol2021glide . Although the above diffusion techniques have recently emerged with encouraging results, generating data containing spatial-temporal information like skeleton sequence leaves much to be desired. Besides, the existing conditional diffusion approach dhariwal2021diffusion suffers from low effectiveness because its classifier is pretrained with noisy images using a complex training strategy to enable conditional guidance.

This paper proposes a novel action synthesis algorithm designed to improve skeleton-based action recognition performance in a data-scarcity situation. Specially, we introduce a conditionally generative model that (1) can generate natural action sequence with enough spatial-temporal information, rather than awkward or repeated ones, (2) can be conditioned by specific action categories so that the generation process is controllable, (3) is not constrained to a specific action domain, for instance, actions with sitting or standing poses, (4) does not rely on noisy training data, in other words, can be trained more efficiently and stably.

To achieve the above goals, we design a transformer-guided diffusion model, which consists of two main modules: a visual transformer (ViT) module mehta2021mobilevit and a denoising diffusion probabilistic models (DDPM) module ho2020denoising ; nichol2021improved . Specifically, the pretrained DDPM module cooperates with the pretrained ViT module through a guiding strategy so that DDPM can sample action sequences under the guidance of ViT. Therefore, it can generate a series of augmented action sequences with only an action label provided. To sum up, our contributions are as follows:

(1)

We propose a transformer-guided diffusion approach for improving skeleton-based action recognition performance. It is specially designed and optimized for handling the data scarcity of field-captured action sequences.

(2)

A spatial-temporal transformer is proposed to learn joint position relations on both spatial and temporal levels and precisely guide the diffusion process towards specific action labels.

(3)

We present a novel guiding strategy that enables conditional guidance from the visual transformer to eliminate the dependency on noisy latent. According to our experiments, this strategy is more effective and practical than existing SOTA methods and contributes significantly to diffusion performance.

2 Related works

Denoising diffusion probabilistic models. Denoising diffusion probabilistic models (DDPMs) consist of a forward diffusion process that gradually inserts noise into inputs and a reverse denoising process that learns to recover data by removing noise. DDPMs have recently been shown to generate high-quality synthetic data, especially images. Many efforts have been made following the invention of DDPMs ho2020denoising . Given the limitations of the original DDPMs, some research lands on refining architecture and optimizing sampling strategy. Denoising diffusion implicit models (DDIM) song2020denoising accelerate the sampling process by constructing a series of non-Markovian diffusion processes rather than simulating a Markov chain. Later, some scholars introduced critically-damped Langevin diffusion (CLD) dockhorn2021score by transferring the successful experience from existing score-based generative models. More recently, DDPMs have been improved in work nichol2021improved by optimizing the variational lower-bound to allow DDPMs to achieve better log-likelihoods. On the other hand, some experts lay emphasis on the conditional generation of high-resolution images. As for scalar conditioning, Dhariwal et al. dhariwal2021diffusion proposed an Unet ronneberger2015u structure that can be integrated into DDPMs and condition the generation process. In order to make DDPMs more controllable, CLIP-based diffusion models radford2021learning ; nichol2021glide ; ramesh2022hierarchical are introduced to leverage the strong visual-language cross-domain representation. Although DDPMs have achieved remarkable results in many domains, they are still rarely seen be utilized for the data augmentation task of skeleton-based action recognition.

Skeleton-based action recognition. With the rapid evolution of pose estimation methods, skeleton-based action recognition approaches are boosted by high-quality skeleton data obtained from advanced pose estimation methods. There are three mainstreams of skeleton-based methods, which are recurrent neural network (RNN) based methods liu2016spatio ; wang2017modeling ; lee2017ensemble ; shi2019skeleton ; li2021memory , convolutional neural network (CNN) based methods li2017skeleton ; li2018co ; caetano2019skelemotion ; li2019learning and GCN based methods yan2018spatial ; li2019actional ; shi2019two ; liu2020disentangling ; song2022constructing ; chen2021channel . In the case of RNN-based methods, they mainly use RNN structure as a long-term temporal learner, which is able to obtain long-range temporal information from input videos. Ref wang2017modeling is a Siamese structure that takes both spatial and temporal information at the same time. Liu et al. liu2016spatio tried to learn the relationship from one dataset to the other. More recently, li2021memory combined the attention mechanism with the RNN model and designed a special temporal attention module that is used for grabbing attention information from the temporal domain of input skeleton sequences. For CNN-based methods, ref li2017skeleton brought a new encoding strategy for skeleton data and mapped them into images. Ref caetano2019skelemotion also focused on the encoding method, and this work considered both joint motion and temporal information from video together. An end-to-end manner based method li2018co was used to utilize different level feature representations. GCN-based methods are the most popular stream of action recognition domain. ST-GCN yan2018spatial began to use a graph to represent the spatial and temporal information of skeleton joints. The potential is not limited by predicting current action, and ref li2019actional ; shi2019two enabled to predict the next action from current skeleton inputs. More recently, MS-G3D liu2020disentangling introduced a multi-scale aggregation scheme to disentangle the significance of neighboring joints for better long-range modeling. EfficientGCN song2022constructing is a GCN-based method aiming at building a faster and more effective action recognition model by refining network designs. CTR-GCN chen2021channel achieved remarkable performance on several popular action recognition datasets leveraging special-designed channel-wise modeling. Despite the encouraging results achieved by the existing skeleton-based methods, few of them consider data scarcity, which is very common in practice.

Conditioned human motion generation. Although generating arbitrary human action is relatively easy and straightforward ormoneit2005representing ; urtasun2007modeling , its sub-task, the action-conditioned human motion generation, seems much harder and has received less attention recently. Some works have considered transferring different modalities (e.g. text, audio waves, action labels) to human motions. Regarding text-to-motion tasks, some pioneer endeavors have been made on the basis of RNN and advanced language models, Text2Action ahn2018text2action , and DVGANs lin2018human utilize textual information to generate corresponding motions. As for the audio-to-motion task, in ref takeuchi2017speech , an long short-term memory (LSTM) model was proposed to translate audio waves to 3D human gestures. More recently, some efforts related to dance generation have been made lee2019dancing ; li2020learning . These approaches mainly take music audio as inputs, which condition dance motion generation. Action-conditioned human motion generation is closer to our topic. Action2Motion guo2020action2motion is a variational auto-encoder (VAE) based model, designed for generating diverse human actions. And ACTOR petrovich2021action presented a novel transformer-based VAE to solve the variable-length motion generation problem. Although the proposed method is also an action-conditioned approach, we try to tackle the motion generation problem in another way: leveraging modern image synthesis techniques to synthesize realistic and natural human motion.

3 DDPMs preliminaries

In this subsection, we reviewed the Denoising Diffusion Probabilistic Models (DDPMs) structures and formulations by following the notations in ref nichol2021improved . The DDPMs workflow is shown in Figure 2, and we can separate DDPMs into two processes: a forward diffusion process that adds noise into an input $x_{0}$ gradually by the timestep $t$ to obtain an isotropic Gaussian noise sample $x_{T}$ . Then a reverse denoising process is conducted, and samples intermediate latent $x_{T-1},x_{T-2}...$ from $x_{T}$ step by step towards a clean sample $x_{0}$ .

During the forward process, we start from an input $x_{0}$ , which obeys Gaussian distribution $q(x_{0})$ . Then, we derive a series of intermediate latent $x_{1},...,x_{t}$ by noising them with Gaussian noise with variance $\beta_{t}$ at each timestep $t$ :

[TABLE]

To sample intermediate latent $x_{t}$ faster, we define $\bar{\alpha}_{t}=\prod_{s=1}^{t}(1-\beta_{s})$ , therefore, we can acquire a diffusion kernel, which can be used to sample $x_{t}$ as follow:

[TABLE]

where $\epsilon\sim\mathcal{N}(0,\textbf{I})$ .

In order to re-sample a new data point from the distribution $q(x_{0})$ , we need to follow the reverse denoising process, which starts from sampling data from the distribution $p(x_{T})=\mathcal{N}(x_{T};0,\textbf{I})$ , then, continuously keeps sampling the posteriors $q(x_{t-1}|x_{t})$ . Since $q(x_{t-1}|x_{t})$ is intractable, we opt for a learnable variational autoencoder (VAE) $p_{\theta}$ to approximate the posteriors by predicting the mean $\mu_{\theta}(x_{t},t)$ and variance $\sigma_{t}^{2}$ of $x_{t-1}$ given input $x_{t}$ . Therefore, intermediate latent $x_{t-1}$ and a new data point $x_{0}$ is able to be sampled from the distribution as follows:

[TABLE]

In practice, the strategy of obtaining the mean and variance is tricky. As for the mean $\mu_{\theta}(x_{t},t)$ , Ho et al. ho2020denoising introduced a noise-prediction network to obtain the mean by predicting the noise $\epsilon_{\theta}(x_{t},t)$ . So we can update Equation 2 as below:

[TABLE]

In terms of variance $\sigma_{t}^{2}$ , Ho et al. ho2020denoising kept $\sigma_{t}^{2}=\beta_{t}$ . However, recent works nichol2021improved ; bao2022analytic found that a parameterized $\sigma_{t}^{2}$ by minimizing the variational bound leads to faster convergence and a more stable training process.

4 Spatial-temporal Transformer-guided Diffusion-based Data Augmentation for Efficient Skeleton-based Action Recognition

We introduce the proposed method in this section with details. In Figure 3, we demonstrate the overview of the proposed method. Given a class condition $y$ , we aim to generate natural action sequences using a pretrained diffusion model $\theta$ under the guidance of a pretrained spatial-temporal transformer $\phi$ . We start from a pretrained diffusion model with two inputs: a class condition $y$ and a noise map $x_{t}$ sampled from Gaussian distribution $\mathcal{N}(0,\textbf{I})$ . Next, a series of intermediate latent $x_{t-1}...x_{1}$ are sampled step by step. At each sampling step, the process is guided by the pretrained transformer $p_{\phi}(y|\hat{x}_{0})$ using its gradient $\nabla_{\hat{x}_{0}}p_{\phi}$ after acquiring the clean estimation $\hat{x}_{0}$ of the noisy latent $x_{t}$ . So that this guiding mechanism leads the sampling process gradually toward the class condition $y$ . The final output is a synthetic skeleton image, and this skeleton image representation is then translated to 3D joint coordinates and can be easily used by action recognition methods.

4.1 Transformer-guided Diffusion

Dhariwal et al. dhariwal2021diffusion proposed a conditional diffusion method, which can leverage a classifier pretrained on noisy images to guide the sampling process toward a class condition. Nevertheless, in order to learn the representation of noisy data, its classifier design and training process is complex and ineffective. To tackle this issue, we propose a novel guiding strategy, which directly estimates a clean image $\hat{x}_{0}$ from the intermediate latent $x_{t}$ and uses it later to obtain the gradient of the transformer. Recall the DDPMs preliminaries, we can estimate the noise in each timestep $\epsilon_{\theta}(x_{t},t)$ , which was added to $x_{0}$ to acquire $x_{t}$ . Naturally, the clean image $x_{0}$ can be derive from $\epsilon_{\theta}(x_{t},t)$ through Equation 2:

[TABLE]

Finally, we define a loss to evaluate the similarity between synthetic and real action sequences. Specifically, we propose to use a simple but effective cross-entropy loss $\mathcal{L}_{C}(\hat{x}_{0},y)$ to evaluate the difference between synthetic and target data distributions. To clearly depict the whole transformer-guided diffusion process, we summarize the proposed method in Algorithm 1.

4.2 Spatial-temporal transformer

In this subsection, we discuss the proposed spatial-temporal transformer in detail. Since the skeleton image representation is in a tiny size and contains rich spatial-temporal information (temporal joint position change). Therefore, it is natural to opt for a visual transformer to deal with these data because ViT has a superior attention mechanism, which can effectively learn relations among patches from the original image.

In this work, we use MobileViT mehta2021mobilevit as our backbone and adjust the network design to our task. Since an action sequence is represented as a tiny size image, we refine the network structure to adjust to the input and modify the internal MobileNetv2 and MobileViT blocks. This improvement allows us to train the proposed spatial-temporal transformer faster and improve the guidance performance for small images.

Experimental results suggest that the proposed spatial-temporal transformer outperforms other state-of-the-art CNNs-based classifiers when guiding a diffusion process in different metrics by leveraging its indigenous attention mechanism. Additionally, its lightweight and compact design enables it to perform better than the other classifiers in the tiny-size skeleton image representation.

5 Experiments

5.1 Dataset and experimental settings

5.1.1 Dataset

NTU RGB+D shahroudy2016ntu ; liu2020ntu NTU RGB+D dataset is a large-scale video dataset for the action recognition task. It contains a total of 120 different action categories and 114,480 video clips ranging from daily actions to two-person interactions. To evaluate data augmentation performance for the action recognition task, we evaluate the proposed method on the full-scale NTU RGB+D dataset. In order to compare to other SOTA action recognition original performances, we use the original pose annotations, which are captured by Kinect. There are two benchmarks suggested: 1) cross-subject benchmark separates video clips into a training set (63,026) and an evaluation set (50,922) by subject characteristics. 2) cross-setup benchmark separates video clips into a training set (54,471) and evaluation set (59,477) by scenario setups. We take 20% of the training set in subject or setup levels to pretrain the diffusion and ST-Trans models and leave the rest of the training set for the training procedure of action recognition approaches in the quantitative experiment. Note that we remove the video clips which are mutual actions or overlong/-short or too noisy in the original dataset.

HumanAct12 Similar to NTU RGB+D, we follow the experimental settings of the other two SOTA methods guo2020action2motion ; petrovich2021action to utilize the HumanAct12 dataset to evaluate the naturality and diversity of the proposed method. HumanAct12 dataset is an adjusted version of the PHSPD dataset zou20203d ; zou2020polarization . All 1,191 video clips are reorganized into 12 action categories, and the corresponding SMPL parameters are also provided.

5.1.2 Evaluation metrics

To fully evaluate the proposed method’s performances and compare them to other SOTA action-conditioned motion generation methods, and evaluate the data augmentation performance properly, we propose two kinds of evaluation metrics in this paper:

5.1.3 Evaluation metrics for naturality and diversity

We follow guo2020action2motion and petrovich2021action to measure the naturality and diversity of synthetic data. Frechet Inception Distance (FID), action recognition accuracy, overall diversity, and per-action diversity, a total of four metrics, are considered in the naturality and diversity experiments. To be specific, FID heusel2017gans is a prevalent metric to evaluate the similarity between synthetic and real data. A lower FID score indicates that the synthetic data is closer to the real data. Additionally, we apply the same pretrained RNN-based action recognition model in guo2020action2motion and petrovich2021action on a set of synthetic action sequences and report the action recognition accuracy. A higher accuracy indicates that the synthetic data distribution is more similar to the real one. As for the overall diversity metric, we extract the features from a set of synthetic and real data using the above pretrained RNN model, then compute the L2 distance between each synthetic-real feature pair. Finally, we utilize the per-action diversity from an L2 distance between each synthetic-real feature pair above but at the class level.

5.1.4 Evaluation metrics of data augmentation task for action recognition

Most SOTA action recognition methods report their cross-subject and cross-setup accuracy on NTU RGB+D 120 dataset. In order to make our results comparable, we follow these two metrics when evaluating the data augmentation performance for the action recognition task.

5.2 Implementation details

5.2.1 Skeleton image representation

Since image synthesis in this work is just an intermediate stage, the image representation should be able to be translated back to the joint coordinates losslessly. We follow Du et al. du2015skeleton to encode the action sequence into a matrix, which has the size of $J\times T\times 3$ , where $J$ is the number of joints, $T$ indicates the length of the corresponding action sequence and 3 is the 3D coordinates of each joint. To avoid resizing operations that may change the pixel value of the skeleton image representation, we keep $J=T$ to acquire a square skeleton image representation. In practice, this image representation will be centrally interpolated with zero paddings into a $32\times 32$ image as the model input.

5.2.2 Experimental details

The diffusion model and the spatial-temporal transformer are pretrained using HumanAct12 and NTU RGB+D following the split of cross-sub or cross-subject. The initial learning rate is set as $1e-4$ . Both models are trained using AdamW loshchilov2017decoupled with $\beta_{1}=0.9$ and $\beta_{2}=0.999$ for 500K iterations. We keep the batch size of each model as 1024. In particular, some hyper-parameters are important to diffusion models. We reference Nicho et al. nichol2021improved to design the diffusion network, which has 128 base channels and three residual blocks per resolution. And we follow the geometric losses introduced by shi2020motionet; petrovich2021action to train the proposed model. Furthermore, we set the number of diffusion steps as 1,000 and used a cosine noise schedule during the training stage. Additionally, we generate synthetic datasets 20 times randomly using different random seeds to report the average with a confidence interval of 95%. As for the experimental environment, all the experiments are conducted through PyTorch in an Ubuntu 18.04 platform with Intel 9700K CPU and two Nvidia RTX Titans.

Regarding action recognition models, we follow the experimental settings of the authors. Please refer to the original papers for details.

Note that since we use a different representation of action sequences and re-train all the methods we want to compare, the experimental results of SOTA competitors in this paper may differ from the original publications.

5.3 Quantitative Results

5.3.1 Naturality and diversity evaluation

In Table 1, we summarise the naturality and diversity experimental results on the HumanAct12 datasets. The proposed method is compared with several SOTA action-conditioned motion generation methods, which are introduced as follows:

•

CondGRU is originally a RNN based audio-to-motion approach shlizerman2018audio , which modified by guo2020action2motion adjusting the network to receive the condition vector and pose vector.

•

Two-stage GAN cai2018deep uses a motion generator to create a noise vector, which can be used to generate 2D motion sequences. Action2Motion’s authors guo2020action2motion managed to enable the Two-stage GAN to work for 3D motions.

•

Act-MoCoGAN tulyakov2018mocogan is a video generation method synthesizing realistic video clips using noise vectors and certain content as inputs. Guo et al. guo2020action2motion updated it with different discriminators to be suitable for motion generation tasks.

•

Action2Motion guo2020action2motion is a gated recurrent unit (GRU) based VAE structure, which can generate natural motion sequences by action conditions at the frame level.

•

ACTOR petrovich2021action is also a VAE based approach, but relies on a transformer architecture to perform encoding and decoding operations.

We can observe the proposed method is able to outperform not only old-fashioned GAN-based methods but also two recent VAE-based approaches in different naturality and diversity metrics. The significant improvements come from the novel diffusion structure, which (1) can leverage its strong capability on image synthesis tasks to generate noise-less and diverse images, and (2) is guided by a spatial-temporal transformer generating motion sequences stably and precisely. Therefore, we can obtain more natural motion sequences with different action categories through the proposed method.

Although the proposed method is not able to achieve the best result on some metrics, it shows its higher robustness and lower fluctuation on these metrics compared to other SOTA approaches.

5.3.2 Data augmentation evaluation

In this subsection, we discuss the experiment of data augmentation for action recognition. The experiment is divided into two parts: (1) Replacing real training data using different proportions of synthetic data; (2) Adding extra synthetic data into training data with different proportions.

Replacement experiment. We present the experimental results of the replacement experiment in Table 2. We conduct the experiment by replacing the training set of three SOTA action recognition methods liu2020disentangling ; song2022constructing ; chen2021channel with different proportions of synthetic data. It is easy to observe that the action recognition accuracy is increasing along with replacing a larger amount of real data. The performance peaks at 20%-30% replacement ratio and decreases when the replacement ratio reaches 50%. The experimental results suggest that the synthetic data created by the proposed method is natural and diverse enough to bring improvements to recent SOTA action recognition models. Moreover, we also involve the synthetic data from two motion generation competitors, and the 50% replacement evaluation results tell that the proposed method can generate more realistic and diverse action sequences, which can be more useful in a downstream action recognition task.

Incremental experiment. We summarize the experimental results of the incremental experiment in Table 3. Similar to the replacement experiment, we apply a mixed training set of real and synthetic data on three SOTA action recognition methods. Rather than replacing real data with synthetic ones, we add extra synthetic data generated by the proposed method or the other two competitors. The experimental results suggest that action recognition performance increases significantly when we bring more synthetic action sequences into the training set and the performance peaks when 40% synthetic data is added. Furthermore, compared to the two competitors, the synthetic data generated by the proposed method can offer more performance gain to the SOTA action recognition models and make the performance more stable.

5.4 Qualitative Results

Figure 3 depicts some examples of synthetic action sequences with two different labels, which are generated from the proposed method. Eight samples are selected from a single action sequence.

As for the examples of ’point to something’ action sequences, the proposed method is able to generate a natural and smooth action sequence, which also contains almost no noise. In terms of the examples of ’kick something’ action sequences, the proposed method can generate a realistic kicking action sequence with diverse and normal pose representations.

5.5 Ablation Studies

In this subsection, we further discuss the components of the proposed method and their contributions to the performance.

Table 4 summarizes the ablation studies about different guiding models and hyper-parameters. From the second part of Table 4, we compare the FID and action recognition accuracy, which different guiding models report. EfficientNet tan2019efficientnet , a Unet-based classifier dhariwal2021diffusion and a fine-tuned CLIP radford2021learning are applied as guiding classifiers. The experiment results show that the proposed spatial-temporal transformer outperforms other competitors due to its high capability of dealing with action sequences containing rich spatial-temporal information. In addition, we found that the number of diffusion steps less than 1,000 (e.g. 250 and 500) is too small for acquiring high-quality action sequences, and the number of more than 1,000 (e.g. 2,500) gains no obvious performance increase. Finally, we compared the linear and cosine noise schedules and found that the cosine noise schedule is better for skeleton image generation because the performance in different metrics is more stable.

In Table 5, the ablation studies about the effectiveness of various guiding strategies are summarized. The experimental results suggest that the proposed method has a smaller model size but a faster convergence speed. A looser dependency on clean training data enables an easier data preparation process and a more straightforward guided-diffusion model design.

6 Conclusions and Future Studies

In this paper, we introduced a novel spatial-temporal transformer-guided diffusion model for action recognition data augmentation tasks. The proposed method takes an action label as an input, then generates high-quality action sequences with the corresponding target labels under the guidance of a spatial-temporal transformer. During the generation process, the proposed spatial-temporal transformer classifies clean intermediate latent generated step-by-step by sampling from a Gaussian distribution. With the experimental results on the naturality and diversity evaluation and the data augmentation evaluation, the proposed method showed the superior capability of synthesizing high-quality action sequences compared to the existing SOTA methods. On top of that, the synthetic action sequences are tested with different SOTA action recognition approaches in two data augmentation tasks. The experimental results suggest that the proposed method can help boost the action recognition performance with its realistically synthetic data. Since the proposed method has the limitation of generating long-period and consistent action sequences, in the future, the authors will investigate the possibility of extending the proposed work to the long-term action sequence synthesis task and further improving the quality of synthetic action sequences.

The authors have no competing interests to declare that are relevant to the content of this article.

Bibliography76

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Yu Kong and Yun Fu. Human action recognition and prediction: A survey. International Journal of Computer Vision , 130(5):1366–1401, 2022.
2[2] Xin Liu, Silvia L Pintea, Fatemeh Karimi Nejadasl, Olaf Booij, and Jan C van Gemert. No frame left behind: Full video action recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 14892–14901, 2021.
3[3] Mengdan Lou, Jieyu Li, Guoxing Wang, and Guanghui He. Ar-c 3d: Action recognition accelerator for human-computer interaction on fpga. In 2019 IEEE International Symposium on Circuits and Systems (ISCAS) , pages 1–4. IEEE, 2019.
4[4] Mohsen Fayyaz, Emad Bahrami, Ali Diba, Mehdi Noroozi, Ehsan Adeli, Luc Van Gool, and Jurgen Gall. 3d cnns with adaptive temporal feature resolutions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 4731–4740, 2021.
5[5] Haodong Duan, Yue Zhao, Yuanjun Xiong, Wentao Liu, and Dahua Lin. Omni-sourced webly-supervised learning for video recognition. In European Conference on Computer Vision , pages 670–688. Springer, 2020.
6[6] Kunchang Li, Xianhang Li, Yali Wang, Jun Wang, and Yu Qiao. Ct-net: Channel tensorization network for video classification. ar Xiv preprint ar Xiv:2106.01603 , 2021.
7[7] Pichao Wang, Shuang Wang, Zhimin Gao, Yonghong Hou, and Wanqing Li. Structured images for rgb-d action recognition. In Proceedings of the IEEE international conference on computer vision workshops , pages 1005–1014, 2017.
8[8] Pichao Wang, Wanqing Li, Zhimin Gao, Chang Tang, and Philip O Ogunbona. Depth pooling based large-scale 3-d action recognition with convolutional neural networks. IEEE Transactions on Multimedia , 20(5):1051–1061, 2018.