Unsupervised Video Anomaly Detection for Stereotypical Behaviours in   Autism

Jiaqi Gao; Xinyang Jiang; Yuqing Yang; Dongsheng Li; Lili Qiu

arXiv:2302.13748·cs.CV·May 15, 2023

Unsupervised Video Anomaly Detection for Stereotypical Behaviours in Autism

Jiaqi Gao, Xinyang Jiang, Yuqing Yang, Dongsheng Li, Lili Qiu

PDF

Open Access

TL;DR

This paper introduces an unsupervised deep learning approach for detecting stereotypical behaviors in autism using video analysis, addressing the challenge of limited labeled data and unbounded behavior types.

Contribution

The paper presents DS-SBD, a novel dual-stream deep model that detects abnormal behaviors in videos without requiring labeled abnormal data, focusing on pose trajectories and action repetition patterns.

Findings

01

Effective detection of stereotypical behaviors in unlabeled videos

02

Outperforms supervised methods in certain scenarios

03

Proposes a new benchmark for autism behavior analysis

Abstract

Monitoring and analyzing stereotypical behaviours is important for early intervention and care taking in Autism Spectrum Disorder (ASD). This paper focuses on automatically detecting stereotypical behaviours with computer vision techniques. Off-the-shelf methods tackle this task by supervised classification and activity recognition techniques. However, the unbounded types of stereotypical behaviours and the difficulty in collecting video recordings of ASD patients largely limit the feasibility of the existing supervised detection methods. As a result, we tackle these challenges from a new perspective, i.e. unsupervised video anomaly detection for stereotypical behaviours detection. The models can be trained among unlabeled videos containing only normal behaviours and unknown types of abnormal behaviours can be detected during inference. Correspondingly, we propose a Dual Stream deep…

Tables3

Table 1. Table 1 : The quantitative comparison results between the state-of-the-art model and our proposed model.

Method	AUROC
Method	micro	macro
Frame-Pred. [15]	52.52%	54.93%
MNAD [24]	53.70%	56.45%
HF2VAD [13]	60.43%	54.35%
DS-SBD-PR	54.54%	51.88%
DS-SBD-PP	62.01%	55.54%
DS-SBD-RD	69.87%	72.81%
DS-SBD	71.04%	73.39%

Table 2. Table 2 : The ablation study of different pose modalities.

PR		PP		RD	AUROC
2D	3D	2D	3D	RD	micro	macro
✓					54.54%	51.88%
	✓				57.85%	61.75%
		✓			62.01%	55.54%
			✓		60.34%	61.40%
				✓	69.87%	72.81%
✓		✓			61.99%	55.53%
✓		✓		✓	71.04%	73.39%
	✓		✓		60.30%	61.42%
	✓		✓	✓	70.65%	73.32%

Table 3. Table 3 : The ablation study of different input frames.

$T$ frames	AUROC
$T$ frames	micro	macro
4	69.37%	72.74%
8	70.07%	72.34%
16	70.12%	72.93%
64	71.04%	73.39%

Equations16

L^{PR} = ∥ F^{PR} (tr (P)) - tr (P) ∥_{2}^{2} = i = 1 \sum N j = 1 \sum K ∥ F^{PR} (P_{j}^{i}) - P_{j}^{i} ∥_{2}^{2}

L^{PR} = ∥ F^{PR} (tr (P)) - tr (P) ∥_{2}^{2} = i = 1 \sum N j = 1 \sum K ∥ F^{PR} (P_{j}^{i}) - P_{j}^{i} ∥_{2}^{2}

s_{i}^{PR} = j = 1 \sum K ∥ F^{PR} (P_{j}^{i}) - P_{j}^{i} ∥_{2}^{2}

s_{i}^{PR} = j = 1 \sum K ∥ F^{PR} (P_{j}^{i}) - P_{j}^{i} ∥_{2}^{2}

\hat{P}^{T + 1} = F^{PP} (tr (P^{1 : T}))

\hat{P}^{T + 1} = F^{PP} (tr (P^{1 : T}))

L^{PP} = ∥ \hat{P}_{c}^{T + 1} - P_{c}^{T + 1} ∥_{2}^{2} + ∥ \hat{P}^{T + 1} - P^{T + 1} ∥_{2}^{2} = ∥ \hat{P}_{c}^{T + 1} - P_{c}^{T + 1} ∥_{2}^{2} + j = 1 \sum K ∥ \hat{P}_{j}^{T + 1} - P_{j}^{T + 1} ∥_{2}^{2}

L^{PP} = ∥ \hat{P}_{c}^{T + 1} - P_{c}^{T + 1} ∥_{2}^{2} + ∥ \hat{P}^{T + 1} - P^{T + 1} ∥_{2}^{2} = ∥ \hat{P}_{c}^{T + 1} - P_{c}^{T + 1} ∥_{2}^{2} + j = 1 \sum K ∥ \hat{P}_{j}^{T + 1} - P_{j}^{T + 1} ∥_{2}^{2}

s_{i}^{PP} = ∥ \hat{P}_{c}^{i} - P_{c}^{i} ∥_{2}^{2} + j = 1 \sum K ∥ \hat{P}_{j}^{i} - P_{j}^{i} ∥_{2}^{2}

s_{i}^{PP} = ∥ \hat{P}_{c}^{i} - P_{c}^{i} ∥_{2}^{2} + j = 1 \sum K ∥ \hat{P}_{j}^{i} - P_{j}^{i} ∥_{2}^{2}

M_{i, j} = softmax (- ∥ x_{i} - x_{j} ∥_{2}^{2})

M_{i, j} = softmax (- ∥ x_{i} - x_{j} ∥_{2}^{2})

s_{i}^{RD} = F^{RD} (X^{i})

s_{i}^{RD} = F^{RD} (X^{i})

S_{i} = α \cdot \frac{s _{i}^{PR} - μ _{PR}}{σ _{PR}} + β \cdot \frac{s _{i}^{PP} - μ _{PP}}{σ _{PP}} + γ \cdot s_{i}^{RD}

S_{i} = α \cdot \frac{s _{i}^{PR} - μ _{PR}}{σ _{PR}} + β \cdot \frac{s _{i}^{PP} - μ _{PP}}{σ _{PP}} + γ \cdot s_{i}^{RD}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAutism Spectrum Disorder Research · Respiratory viral infections research · Genetics and Neurodevelopmental Disorders

Full text

Unsupervised Video Anomaly Detection for Stereotypical Behaviours in Autism

Abstract

Monitoring and analyzing stereotypical behaviours is important for early intervention and care taking in Autism Spectrum Disorder (ASD). This paper focuses on automatically detecting stereotypical behaviours with computer vision techniques. Off-the-shelf methods tackle this task by supervised classification and activity recognition techniques. However, the unbounded types of stereotypical behaviours and the difficulty in collecting video recordings of ASD patients largely limit the feasibility of the existing supervised detection methods. As a result, we tackle these challenges from a new perspective, i.e. unsupervised video anomaly detection for stereotypical behaviours detection. The models can be trained among unlabeled videos containing only normal behaviours and unknown types of abnormal behaviours can be detected during inference. Correspondingly, we propose a Dual Stream deep model for Stereotypical Behaviours Detection, DS-SBD, based on the temporal trajectory of human poses and the repetition patterns of human actions. Extensive experiments are conducted to verify the effectiveness of our proposed method and suggest that it serves as a potential benchmark for future research.

**Index Terms— ** video anomaly detection, autism spectrum disorder, stereotypical behaviours

1 Introduction

Autism spectrum disorder (ASD) is a neurological and developmental disorder [1] that begins early in childhood and even lasts throughout a person’s life. It causes problems with functioning in society and often affects how people interact, communicate, and socialize with others, resulting in stereotypical behaviours [2]. Stereotypical behaviours refer to the abnormal and non-functional repetitive behaviours that happens with no obvious stimulus, such as arm flapping, head banging, and spinning. They will negatively affect ASD children’s performance on skill acquisition and social interaction, and as a stress indicator it could even lead to a meltdown event or cause self-damaging behaviours [3]. As a result, monitoring, evaluating, and analyzing the stereotypical behaviours are essential for the clinicians and caregivers to treat and take care of ASD patients, and an automated stereotypical behaviour detection system holds great potentials in the ASD patient caring and treatment.

In this paper, we focus on automatically detecting stereotypical behaviours from video recording of the ASD patients. Ryan et al. [4] surveyed recent vision-based methods which mainly focus on how to correctly classify the stereotypical behaviours in autism with the help of action recognition [5, 6, 7] and video classification [8, 9, 10, 11, 12] techniques. Existing methods perform well on a limited set of pre-defined stereotypical behaviour types through supervised learning paradigms. However, in practice, stereotypical behaviours detection is an open-set problem, where types of ASD stereotypical behaviours are unbounded with a large variance across different patients. Thus, there will always be novel behaviour types unseen in the training set, which previous methods are not able to detect. Furthermore, the collection of clinical video datasets brings great challenge, due to the privacy concerns and high data annotation cost from medical professionals.

To solve the challenge of unknown behaviour types and data collection difficulty, we propose to study ASD stereotypical behaviours detection from a new perspective, i.e. unsupervised video anomaly detection (VAD). Unsupervised VAD learns the distribution of normal behaviours during training and distinguishes the anomaly ASD behaviours as the outlier of the learned distribution. Since unsupervised VAD can detect any anomaly types out of the normal behaviours distribution, it is not limited by a finite set of pre-defined anomaly types. Furthermore, unsupervised VAD does not require to collect any data containing abnormal behaviours for training. Hence, it eases the burden of collecting clinic videos containing ASD patients.

However, existing unsupervised VAD approaches [13, 14, 15, 16, 17] mainly focus on surveillance scenarios, and directly migrating them to stereotypical behaviour detection is non-trivial, for two reasons:

Stereotypical behaviours of ASD patients contain a specific repetitive patterns, while exiting unsupervised VAD methods can not incorporate such prior knowledge.
The videos of ASD patients are recorded under a unconstrained environment with various viewpoints and background noises, which brings challenges to the conventional unsupervised VAD methods focusing on surveillance videos under a constrained environment.

As a result, we propose a novel Dual Stream network for Stereotypical Behaviours Detection, DS-SBD, where each stream tackles one of the aforementioned two challenges respectively. Specifically, to improve the robustness over domain variance and background noises, we propose a pose trajectory module that models the temporal consistency of the human actions based on the temporal trajectory of human pose keypoints, filtering out the background noises and domain variance of the raw image frames. Secondly, to incorporate the repetition pattern of ASD stereotypical behaviours, we propose a repetition detection module which detects the abnormal behaviours based on frame level repetitive patterns. The proposed DS-SBD is trained in an unsupervised fashion over videos containing only normal human behaviours with three proxy tasks, i.e. pose reconstruction task, pose prediction task, and repetition detection task.

Our main contributions are summarized as follows:

To tackle unknown behaviour types and data collecting difficulty, we formulate ASD stereotypical behaviours detection as an unsupervised video anomaly detection task and reorganize the existing self-stimulatory behaviour dataset (SSBD) for evaluation.
To leverage the ASD stereotypical behaviour prior knowledge and improve the robustness, we propose a dual stream abnormal detection network DS-SBD ensembled by novel pose trajectory and repetition detection modules.
Extensive experimental results and ablation studies verify the effectiveness of each modules, suggesting DS-SBD could serve as a benchmark for this new challenging task in the future.

2 Methodology

Fig. 1 shows the overall network structure of our proposed DS-SBD. It is a dual stream structure containing two modules, namely a pose trajectory module and a repetition prediction module. The pose trajectory module is responsible for detecting stereotypical behaviours based on human pose trajectories. The repetition module detects the abnormal behaviours based on the action repetitions over a certain period. Following the unsupervised video anomaly detection training settings, the training set only needs to contain videos with normal behaviours. The model is expected to learn the distribution of normal behaviours from training videos, and outputs a frame-level anomaly score at the inference time to judge whether it is an out of distribution abnormal behaviour.

2.1 Preliminaries

Given a video with $N$ frames as $\mathbf{X}=[\mathbf{X}^{1},\mathbf{X}^{2},\ldots,\mathbf{X}^{N}]\in\mathbb{R}^{N\times C\times H\times W}$ , the corresponding human poses with $K$ keypoints in $i$ -th frame is denoted as $\mathbf{P}^{i}=[\mathbf{P}^{i}_{1},\mathbf{P}^{i}_{2},\ldots,\mathbf{P}^{i}_{K}]\in\mathbb{R}^{1\times K\times d}$ , where $d$ is the coordinate dimensions of human pose. The trajectory of $j$ -th keypoints of one human pose is defined as $\mathsf{tr}(\mathbf{P}_{j})=[\mathbf{P}^{1}_{j},\mathbf{P}^{2}_{j},\ldots,\mathbf{P}^{N}_{j}]\in\mathbb{R}^{N\times 1\times d}$ .

2.2 Pose Trajectory Module

The pose trajectory module is trained with two proxy tasks, i.e. the pose reconstruction task and the pose prediction task.

2.2.1 Pose Reconstruction

The pose reconstruction (PR) proxy task takes the assumption that the pose trajectories of normal behaviours can be well reconstructed by an autoencoder while the anomaly behaviours can not. Specifically, an LSTM based autoencoder $\mathcal{F}^{\mathrm{PR}}$ is proposed for the reconstruction proxy task. $\mathcal{F}^{\mathrm{PR}}$ takes a human pose trajectory $\mathsf{tr}(\mathbf{P})$ as input and aims at reconstructing each human pose keypoints in this trajectory during training. A MSE training loss $\mathcal{L}^{\mathrm{PR}}$ is used to optimize $\mathcal{F}^{\mathrm{PR}}$ :

[TABLE]

During inference, the pose reconstruction errors of $\mathcal{F}^{\mathrm{PR}}$ on each keypoint in a frame is summed up the to get a frame-level anomaly score:

[TABLE]

2.2.2 Pose Prediction

The pose prediction (PP) proxy task assumes that normal human behaviours are temporally consistent, while abnormal ones usually come with unexpected change of actions. Specifically, given a trajectory of $T$ consecutive poses $\mathsf{tr}(\mathbf{P}^{1:T})$ , the pose prediction task attempts to forecast the next human pose $\mathbf{P}^{T+1}$ with a deep model $\mathcal{F}^{\mathrm{PP}}$ :

[TABLE]

Following [18], the pose prediction task is built upon local pose trajectory (all keypoints) and global pose trajectory (center point of all keypoints) forecasting. For local keypoints, $\mathcal{F}^{\mathrm{PP}}$ is an LSTM-based variational autoencoder. For a center point, a cascaded LSTM is used for prediction. Similar to the pose reconstruction task, we use MSE to optimize $\mathcal{F}^{\mathrm{PP}}$ :

[TABLE]

where $\hat{\mathbf{P}}^{T+1}$ and $\hat{\mathbf{P}}_{c}^{T+1}$ are the predicted local pose keypoints and the global pose keypoint of the $T+1$ frame, respectively.

Similar to the pose reconstruction module, the anomaly score of one frame is its forecasting errors given past trajectories:

[TABLE]

2.3 Repetition Detection Module

We observe that one of the most distinct characteristics of the stereotypical behaviours in autism spectrum disorder is the repetitive pattern. In other words, the anomaly behaviours would be repeated periodically over short time intervals in the videos. To leverage this essential prior knowledge, we propose a repetition detection module (RD). Inspired by recent repetition counting methods [19, 20, 21], we model the repetitive patterns as a temporal self-similarity matrix $\mathbf{M}$ , whose elements $\mathbf{M}_{i,j}$ are the similarity score between the feature embedding of $i$ -th frame and $j$ -th frame, followed by the row-wise softmax operation [20].

[TABLE]

where $x_{i}$ and $x_{j}$ are the latent feature embeddings of $i$ -th and $j$ -th frames.

Based on the self-similarity matrix, the repetition detection module $\mathcal{F}^{\mathrm{RD}}$ outputs an anomaly score for each video frame, showing the probability whether this frame contains repetitive actions:

[TABLE]

where $s^{\mathrm{RD}}_{i}$ is the $i$ -th frame anomaly score of the repetition detection module, and $\mathbf{X}^{i}$ is the $i$ -th input frame. The proposed repetition detection module can be trained on public repetition counting dataset or videos synthesized from the VAD training set.

2.4 Anomaly Score

The ultimate anomaly score for each video frame $S_{i}$ is the weighted sum of two anomaly scores from the pose trajectory module and one anomaly score from the repetition detection module:

[TABLE]

where $\alpha$ , $\beta$ , and $\gamma$ are the weights of the three anomaly scores, $\mu_{\mathrm{PR}}$ , $\sigma_{\mathrm{PR}}$ , $\mu_{\mathrm{PP}}$ , and $\sigma_{\mathrm{PP}}$ are the means and standard deviations of training pose reconstruction and prediction errors.

3 Experiments

3.1 Dataset

In our experiments, we use the self-stimulatory behaviour dataset (SSBD) [2] to evaluate the models, which is the publicly-available benchmarking dataset for stereotypical behaviour detection. The SSBD dataset contains 75 videos with three stereotypical behaviours, i.e. arm flapping, head banging, and spinning. Following the setting of unsupervised VAD, we split the dataset to testing set with 20 videos and training set with rest of the videos. All sub-clips containing stereotypical behaviours are excluded in the training videos.

3.2 Implementation Details

We choose the Adam optimizer for training and the learning rate is set to 0.004. AlphaPose [22] is used to generate the 2D human pose and VideoPose3D [23] is used to generate the 3D human pose. The batch size is set to 60 and the number of consecutive frames in one batch $T$ is set to 64. The repetition detection module applies the backbone of RepNet [20]. We provide the non-overlapping sliding windows of $T$ frames to compute the final frame-level anomaly scores during testing.

3.3 Results

We use two widely used evaluation metrics in video anomaly detection community, i.e. micro-averaged area under receiver operation characteristic curve (AUROC), and macro-averaged AUROC, to evaluate the models. Specifically, the micro-averaged AUROC is to compute the overall frame-level AUC by concatenating all the frames during testing. The macro-averaged AUROC is the average of AUC grouped by videos varying the threshold.

We report the results of our DS-SBD with different proxy tasks and compare with several state-of-the-art unsupervised VAD methods, including Frame-Pred. [15], HF2VAD [13], and MNAD [24] in Table 1. The model performance is boosted with the three effective auxiliary tasks from 54.54% to 71.04% of micro-AUROC and the best macro-AUROC reaches 73.39%, which significantly outperforms the baseline models. In addition, we observe that the repetition detection module plays a dominant role of unsupervised video anomaly detection for autism spectrum disorder because the stereotypical behaviours are often characterized by repetition. The visualization results are shown in Fig 2.

3.4 Ablation Study

We conduct ablation studies to investigate the factors that may contribute to the anomaly detection performance.

2D pose vs. 3D pose. Although, 3D skeleton trajectory can provide the depth information of human motion, it is usually not as stable and robust compared to 2D pose prediction models because inferring 3D information from 2D frames is more challenging. As shown in Table 2, our method achieves better performance when taking 2D poses as input.

Number of frames. Considering the temporal consistency and periodicity of each stereotypical behaviours, we also investigate whether the different number of input frames will affect the performance. As shown in Table 4, the model achieves the best performance when the input is a relatively long sequence of frames (e.g. $T=64$ ). This is because stereotypical behaviours with low frequency often require more information from history frames to accurately discover a periodic repetition patterns.

Weight estimation. We estimate the $\alpha$ , $\beta$ , and $\gamma$ by grid search from 0 to 3. In Table 4, the DB-SBD* achieves the best performance when $\alpha$ =1.5, $\beta$ =0.2, $\gamma$ =1.3 with the marginal improvement compared with the default settings ( $\alpha$ = $\beta$ = $\gamma$ =1), which shows our model is relatively robust.

4 Conclusion

In this paper, we drive a new research perspective of stereotypical behaviours detection in autism spectrum disorder, i.e. unsupervised video anomaly detection. To better leverage the prior knowledge of ASD and improve the robustness, we propose a dual stream deep model DS-SBD that detects abnormal behaviours based on temporal trajectory of human poses and the repetition patterns of human actions. Extensive experimental results demonstrate the effectiveness of our method and may act as a benchmark for future research. In the future, we will investigate more simple but effective proxy tasks to boost the model discriminability.

Bibliography24

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Isabelle Rapin, “Autistic children: Diagnosis and clinical features,” Pediatrics , 1991.
2[2] Shyam Rajagopalan, Abhinav Dhall, and Roland Goecke, “Self-stimulatory behaviours in the wild for autism diagnosis,” in ICCVW , 2013.
3[3] Nastaran Mohammadian Rad, Seyed Mostafa Kia, Calogero Zarbo, Twan van Laarhoven, Giuseppe Jurman, Paola Venuti, Elena Marchiori, and Cesare Furlanello, “Deep learning for automatic stereotypical motor movement detection using wearable sensors in autism spectrum disorders,” Signal Processing , 2018.
4[4] Ryan Anthony J de Belen, Tomasz Bednarz, Arcot Sowmya, and Dennis Del Favero, “Computer vision in autism spectrum disorder research: a systematic review of published studies from 2009 to 2019,” Translational psychiatry , 2020.
5[5] Deepak Pandian, Shyam Sundar Rajagopalan, Dinesh Babu Jayagopi, et al., “Detecting a child’s stimming behaviours for autism spectrum disorder diagnosis using rgbpose-slowfast network,” in ICIP , 2022.
6[6] Prashant Pandey, AP Prathosh, Manu Kohli, and Josh Pritchard, “Guided weak supervision for action recognition with scarce data to assess skills of children with autism,” in AAAI , 2020.
7[7] Pengbo Wei, David Ahmedt-Aristizabal, Harshala Gammulle, Simon Denman, and Mohammad Ali Armin, “Vision-based activity recognition in children with autism-related behaviors,” ar Xiv preprint ar Xiv:2208.04206 , 2022.
8[8] Anish Lakkapragada, Aaron Kline, Onur Cezmi Mutlu, Kelley Paskov, Brianna Chrisman, Nathaniel Stockham, Peter Washington, Dennis Paul Wall, et al., “The classification of abnormal hand movement to aid in autism detection: Machine learning study,” JMIR Biomedical Engineering , 2022.