Spatio-Temporal Adversarial Learning for Detecting Unseen Falls

Shehroz S. Khan; Jacob Nogas; Alex Mihailidis

arXiv:1905.07817·cs.LG·July 24, 2020

Spatio-Temporal Adversarial Learning for Detecting Unseen Falls

Shehroz S. Khan, Jacob Nogas, Alex Mihailidis

PDF

TL;DR

This paper introduces a spatio-temporal adversarial learning framework that detects unseen falls by modeling normal activities and identifying anomalies, addressing the challenge of limited fall data in fall detection systems.

Contribution

It proposes a novel adversarial learning approach using spatio-temporal autoencoders and convolutional networks to detect unseen falls without training on fall data.

Findings

01

Outperforms baseline methods on three datasets

02

Effective in privacy-preserving camera modalities

03

Detects unseen falls accurately

Abstract

Fall detection is an important problem from both the health and machine learning perspective. A fall can lead to severe injuries, long term impairments or even death in some cases. In terms of machine learning, it presents a severely class imbalance problem with very few or no training data for falls owing to the fact that falls occur rarely. In this paper, we take an alternate philosophy to detect falls in the absence of their training data, by training the classifier on only the normal activities (that are available in abundance) and identifying a fall as an anomaly. To realize such a classifier, we use an adversarial learning framework, which comprises of a spatio-temporal autoencoder for reconstructing input video frames and a spatio-temporal convolution network to discriminate them against original video frames. 3D convolutions are used to learn spatial and temporal features from…

Tables4

Table 1. Table 1: Configuration of the 3D Generator. The values inside the parenthesis for fully connected layers are the number of neurons.

Input	(8, 64, 64, 1)
Encoder	3D Convolution - (8, 64, 64, 16)
	3D Convolution - (8, 32, 32, 8)
	3D Convolution - (4, 16, 16, 8)
	3D Convolution - (2, 8, 8, 8)
Decoder	3D Deconvolution - (4, 16, 16, 8)
	3D Deconvolution - (8, 32, 32, 8)
	3D Deconvolution - (8, 64, 64, 16)
	3D Convolution - (8, 64, 64, 1)

Table 2. Table 2: Configuration of the DAE-AN . The values inside the parenthesis for fully connected layers are the number of neurons.

Input	(64, 64, 1)
Encoder	Fully Connected - (4096)
	Fully Connected - (1500)
	Fully Connected - (1000)
	Fully Connected - (500)
Decoder	Fully Connected - (1000)
	Fully Connected - (1500)
	Fully Connected - (4096)
	Fully Connected - (64, 64, 1)

Table 3. Table 3: Configuration of the CAE-AN. The values inside the parenthesis are the size of the convolution filters.

Input	(64, 64, 1)
Encoder	2D Convolution - (64, 64, 16)
	2D Convolution - (32, 32, 16)
	2D Convolution - (16, 16, 8)
	2D Convolution - (8, 8, 8)
Decoder	2D Deconvolution - (16, 16, 8)
	2D Deconvolution - (32, 32, 8)
	2D Deconvolution - (64, 64, 16)
	2D Deconvolution - (64, 64, 1)

Table 4. Table 4: AUC values for different adversarial networks for each dataset (using frame based anomaly scoring).

Models	Datasets
Models	Thermal	UR	UR-Filled	SDU	SDU-Filled
DAE-AN	0.62	0.46	0.65	0.68	0.91
CAE-AN	0.62	0.36	0.78	0.62	0.89
$C_{μ}$	0.95	0.47	0.88	0.69	0.90
$C_{σ}$	0.95	0.74	0.91	0.69	0.91

Equations17

v_{ij}^{x y z} = f (m \sum p = 0 \sum P_{i} - 1 q = 0 \sum Q_{i} - 1 s = 0 \sum S_{i} - 1 w_{ij m}^{pq s} v_{(i - 1) m}^{(x + p) (y + q) (z + s)} + b_{ij})

v_{ij}^{x y z} = f (m \sum p = 0 \sum P_{i} - 1 q = 0 \sum Q_{i} - 1 s = 0 \sum S_{i} - 1 w_{ij m}^{pq s} v_{(i - 1) m}^{(x + p) (y + q) (z + s)} + b_{ij})

O = I \sim p

O = I \sim p

R min D max (E_{I \sim p} [l o g (D (I))]) + E_{O \sim p} [l o g (1 - D (R (O)))])

R min D max (E_{I \sim p} [l o g (D (I))]) + E_{O \sim p} [l o g (1 - D (R (O)))])

L_{R} = ∥ I_{i, j} - O_{i, j} ∥_{2}^{2}

L_{R} = ∥ I_{i, j} - O_{i, j} ∥_{2}^{2}

L = L_{R} + λ L_{R + D}

L = L_{R} + λ L_{R + D}

R_{i, j} = ∥ I_{i, j} - O_{i, j} ∥_{2}^{2}

R_{i, j} = ∥ I_{i, j} - O_{i, j} ∥_{2}^{2}

C_{μ}^{j}

C_{μ}^{j}

C_{σ}^{j}

W_{μ}^{i} = \frac{1}{T} j = i \sum T + i - 1 R_{i, j}, W_{σ}^{i} = \frac{1}{T} j = i \sum T + i - 1 (R_{i, j} - W_{μ}^{i})

W_{μ}^{i} = \frac{1}{T} j = i \sum T + i - 1 R_{i, j}, W_{σ}^{i} = \frac{1}{T} j = i \sum T + i - 1 (R_{i, j} - W_{μ}^{i})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsSolana Customer Service Number +1-833-534-1729 · Convolution

Full text

∎

11institutetext: Shehroz S. Khan 22institutetext: KITE – Toronto Rehabilitation Institute,

University Health Network, Canada.

22email: [email protected] 33institutetext: Jacob Nogas, Alex Mihailidis 44institutetext: University of Toronto, Canada.

44email: [email protected], [email protected]

Spatio-Temporal Adversarial Learning for Detecting Unseen Falls

Shehroz S. Khan

Jacob Nogas

Alex Mihailidis

(Received: date / Accepted: date)

Abstract

Fall detection is an important problem from both the health and machine learning perspective. A fall can lead to severe injuries, long term impairments or even death in some cases. In terms of machine learning, it presents a severely class imbalance problem with very few or no training data for falls owing to the fact that falls occur rarely. In this paper, we take an alternate philosophy to detect falls in the absence of their training data, by training the classifier on only the normal activities (that are available in abundance) and identifying a fall as an anomaly. To realize such a classifier, we use an adversarial learning framework, which comprises of a spatio-temporal autoencoder for reconstructing input video frames and a spatio-temporal convolution network to discriminate them against original video frames. 3D convolutions are used to learn spatial and temporal features from the input video frames. The adversarial learning of the spatio-temporal autoencoder will enable reconstructing the normal activities of daily living efficiently; thus, rendering detecting unseen falls plausible within this framework. We tested the performance of the proposed framework on camera sensing modalities that may preserve an individual’s privacy (fully or partially), such as thermal and depth camera. Our results on three publicly available datasets show that the proposed spatio-temporal adversarial framework performed better than other baseline frame based (or spatial) adversarial learning methods.

Keywords:

Fall, Spatio-Temporal, Adversarial Learning, Autoencoder, Thermal Camera, Depth Camera

1 Introduction

Falls can cause severe injuries to people resulting in permanent or partial disability, huge health care costs and development of negative social and psychological problems smartrisk . This constitutes a strong motivation to detect falls. However, a fall occurs rarely in comparison to normal activities of daily living (ADL) khan2017detecting . Due to their rarity of occurrence, traditional supervised machine learning classifiers are difficult to use for this task khan2017review . In many cases, there may be very few or no fall data available during training because collecting fall data is very challenging and can put people’s life in danger khan2017review . On the other hand, normal ADL performed by people are abundantly available and easier to collect. Therefore, we propose to detect falls in a one-class classification (OCC) framework khan2014one that enables a classifier to learn only from the normal ADL and be able to detect an unseen fall during testing (as they may not be present during training).

Learning one-class classifiers from video sequences of normal ADL to detect falls as anomaly is a challenging task nogasfall2018 . Previous research suggests that autoencoders can effectively learn ‘normal’ ADL from wearable and computer vision data and be able to detect abnormal variations, such as falls, based on the reconstruction error khan2017detecting nogasfall2018 . For detecting falls from videos, spatio-temporal autoencoders have shown to perform well in comparison to 2-D convolutional autoencoders and general deep autoencoders nogas2019deepfall . Another challenge in video based fall detection is to preserve the privacy of the person, which traditional RGB cameras cannot provide. Thus, detecting falls in videos without explicitly knowing a person’s identity is important from the real world usability of such systems.

Convolutional neural networks (CNN), recurrent neural networks and spatio-temporal convolutional neural networks are commonly used to detect human activities zhao2017pooling xu2018sequential ji20133d and anomalies sabokrou2018deep zhou2019anomalynet zhou2016spatial in videos . The adversarial learning framework using different neural network models has also been used effectively to solve anomaly detection problems in images akcay2018ganomaly schlegl2019f schlegl2017unsupervised and videos vu2019robust ravanbakhsh2017abnormal . The learning paradigm using generative adversarial networks (GAN) presents a unique opportunity to not only mimic normal behaviour through the generator but also to effectively discriminate it from anomalies schlegl2017unsupervised . In the context of fall detection problem, adversarial learning will help in mimicking the normal ADL with high accuracy, which could result in detecting unseen falls with a higher degree of confidence. Therefore, in this paper, we extend the idea of training spatio-temporal autoencoder in an adversarial manner to validate their role in detecting unseen falls from (privacy protecting) videos. The proposed framework is different from the original formulation of GAN for anomaly detection, where images are generated from Gaussian noise schlegl2017unsupervised .

In this paper, we design a new spatio-temporal adversarial learning framework, which consists of a spatio-temporal convolutional autoencoder (3DCAE) to reconstruct a sequence of normal ADL video frames and a spatio-temporal convolutional neural network (3DCNN) as a classifier to discriminate them from the original sequence of video frames. The spatio-temporal architecture of the adversarial framework consists of 3D convolutional layers that will extract both spatial and temporal features from the video frames that will result in a robust system to learn normal ADL from the video sequences. After the training is completed, the 3DCAE would be able to reconstruct ADL sequences efficiently and the 3DCNN would be able to differentiate between real and reconstructed ADL sequences. During testing, when a video sequence containing fall frames is shown to this network, high reconstruction error and/or low probability of the discriminator will indicate an anomalous video sequence. Therefore, this framework would be able to identify unseen falls with high accuracy. The reconstruction error of the 3DCAE or the probability output of the 3DCNN or their combination can be used as an anomaly score to identify unseen falls during testing. We use two computer vision sensing modalities, thermal and depth cameras, to test the proposed framework. Both these sensing modalities can partially or fully obfuscate the facial identify of the person; thus, they are more promising to be used in a home-setting. We also implemented two spatial (or frame-based) variations of adversarial learning baselines with (i) a deep autoencoder to reconstruct input frames and a deep neural network as a discriminator, and (ii) a convolutional autoencoder to reconstruct input frames and a CNN as a discriminator (similar to the work of Sabokrou2018Adversarially ). The input to both of these methods is a frame from the video, whereas the input to our proposed method is a sequence of video frames. Our results on three publicly available fall detection datasets captured using thermal and depth cameras show superior performance of the spatio-temporal adversarial learning framework in detecting unseen falls in comparison to these spatial adversarial approaches.

The paper is organized as follows. In Section 2, we present literature review on using adversarial techniques for anomaly detection in images and videos. In Section 3, we introduce the proposed spatio-temporal adversarial learning framework. Section 4 presents various anomaly scores to detect unseen falls. The experiments and results are described in Section 5, followed by conclusions and pointers to future research in Section 6.

2 Related Work

In this paper, we detect falls in an OCC framework. To the best of our knowledge, fall detection has not been addressed using an adversarial learning framework; therefore, we present related literature review on techniques that use adversarial learning of autoencoders (or their variants) for general anomaly detection in images and videos.

2.1 Adversarial Anomaly Detection in Images

One of the earliest work to detect anomalies using adversarial framework is presented by Schlegl et al. schlegl2017unsupervised to find anomalies in imaging data as candidates for markers, called as AnoGan. The generator of their GAN is equivalent to a multi-layered convolutional decoder that samples input from uniformly distributed noise. The discriminator is a standard CNN that maps 2D images to a single value that can be interpreted as a probability whether the input to it is a real image or is produced by the generator. They use the combination of residual and discrimination losses as an anomaly score, such that a large score means an anomalous image. They extended their approach by presenting a faster anomaly detection algorithm (f-AnoGAN schlegl2019f ) that used improved WGAN architecture and speed up mapping of input images to the latent space. Beggel et al. Beggel2019Robust considered identifying anomalies in images when the training set is contaminated with a small fraction of outliers. They trained an adversarial autoencoder that imposed a prior distribution on the latent representation by placing anomalies in the low likelihood-region. This architecture helped in identifying potential anomalies and robust detection in the presence of outliers during training. Pidhorskyi NIPS2018_7915 presented a probabilistic approach to adversarial training of autoencoders for anomaly detection by estimating the likelihood of a sample being generated by the inlier distribution. This was achieved by linearizing the parameterized manifold capturing the underlying structure of the inlier distribution and improved autoencoder training. Their results on several publicly available image datasets show improved results.

Eide eide2018applying applied generative adversarial learning to find anomalies in hyper-spectral remote sensing images. Their generator is based on ResNet, which maps low-dimensional input to a higher dimension image; thus, works as a convolutional decoder. The discriminator has a similar design but works in the opposite direction. They modify the reconstruction cost of the generator by adding a term for the norm of generated input. The modified reconstruction cost penalizes reconstructions from unlikely inputs more heavily. However, adding this term is not found to be helpful as the generator is unable to reconstruct anomalies even without any penalty term. Yarlagadda et al. yarlagadda2018satellite present the use of adversarial autoencoder learning for satellite image forgery detection and localization. The generator in their structure is an autoencoder and the discriminator is a CNN. The adversarial trained autoencoder encodes the image patches into low dimensional features, which are then used to train a one-class SVM to detect forged patches. Lawson et al. lawson2017finding present the use of adversarial trained deep convolutional autoencoder for finding anomalies in autonomous robot patrol view. Their method first learns the model for normal scene from the autoencoder based generator and then use the features learned to find anomalies in the environment. More specifically, they compare the difference between the bottleneck features extracted with real images and reconstructed images and use it as a measure for finding anomalies.

2.2 Advesarial Anomaly Detection in Videos

Sabokrou et al. Sabokrou2018Adversarially present an end-to-end OCC method that uses adversarial learning. The generator of their network is a convolutional autoencoder, which reconstructs the input with added noise. The discriminator is a typical CNN that takes reconstructed and real input and gives a likelihood estimate of the target score. After the adversarial training, the discriminator can be used to detect anomalies. They also show that applying discriminator on the reconstructed images can provide better separation; hence, better performance. Their results on MNIST, Caltech-256 and UCSD Ped2 datasets show the viability of learning one-class classifiers in an adversarial manner. Lee et al. lee2018stan present a spatio-temporal adversarial learning framework for anomaly detection in videos. Their framework consists of a spatio-temporal generator and discriminator. The network operates on a sequence of $N+1$ video frames. The generator takes as input the first and last $\frac{N}{2}$ frames and then generates the missing $\frac{N}{2}+1^{th}$ frame. This middle frame is generated by a bi-directional convolutional LSTM network. The discriminator consists of a 3DCNN that takes a sequence of $N+1$ frames as input, which has one generated frame and rest are original frames. The discriminator then tries to recognize this sequence as fake, while the generator must improve to generate the middle frame in order to fool the discriminator. A potential issue with such an approach is that the discriminator is given a very difficult task to only detect one frame in a sequence; and conversely the generator is given an easy task. Vu et al. vu2019robust presented a multi-level representation of intensity and motion in videos to identify anomalies. Their framework consisted of a de-noising autoencoder, conditional generative adversarial network and anomalous region detector at each representation level. Besides showing improved results on UCSD Ped 1, UCSD Ped 2 and Avenue video anomaly datasets, their model was able to detect mislabeled anomalies in UCSD Ped 1 dataset. Li and Chang li2019video presented an approach to train Multivariate Gaussian Fully Convolution Adversarial Autoencoder to map the latent space representations of normal samples. A deep CNN was employed for the encoder of the deep network, then an energy based method is applied to obtain anomaly score. The appearance and motion representations were combined to obtain robust anomaly detection results on three public datasets. Liu et al. liu2018future used the difference between future frame prediction and ground truth as a factor to detect anomalies in videos. Their objective function combines different losses, including appearance (intensity loss and gradient loss), motion (optical flow loss) and adversarial loss. They adopted a U-net ronneberger2015u as a generator and a Markovian discriminator isola2017image in their framework. Li et al. li2019spatio presented a U-net based frame prediction method using normal events in videos and detecting abnormality using prediction error. They considered different types of losses in their objective function that includes intensity, gradient, motion, RGB gradient and a mean square error loss during adversarial training. Nguyen and Meunier nguyen2019anomaly designed a video anomaly detection framework that combines a Convolution Autoencoder and a U-Net that is integrated with an Inception module leading to a patch based frame level anomaly score. They trained this network using distance based loss, optical flow loss and adversarial loss. Ravanbaksh et al.ravanbakhsh2017abnormal present the use of adversarial learning for anomaly detection in crowded scenes. They train two conditional U-nets isola2017image ; one each for generating optical flow from frames and the other generating frames from optical flow using image and noise vector as inputs. The conditional discriminator takes either of the generated images and compares against the real image to produce a probability that both of its input images come from the real data. However, this method may not work well with occluded scenes and it may be difficult to estimate the optical flow map. Tang et al. tang2020integrating combined the future frame prediction and reconstruction error of two U-nets connected in a pipeline, combined with a pixel-level discriminator to detect anomalies in videos. They used intensity, gradient and temporal image difference losses and trained the network in an adversarial manner. They showed better results on several public datasets in comparison to the baseline presented by Liu et al. liu2018future . Zhou et al. zhou2019attention used attention based loss in an adversarial learning setting to alleviate the foreground-background imbalance problem in anomaly detection in videos. They considered U-net based generator and a patch discriminator. Their results on public video datasets showed improvement in comparison to the baseline of Liu et al. liu2018future . Some other variants of 3D GANs are also proposed for other applications. Wang et al. wang2017shape combine 3D GAN with Recurrent Convolutional Networks for Shape Inpainting, and Zhang et al. zhang2019adversarial present a 3D GAN for video deblurring.

The spatio-temporal adversarial learning method to detect unseen falls presented in this paper extends the work of Sabokrou et al. Sabokrou2018Adversarially from single image to a sequence of images (video) by learning spatio-temporal features. The proposed frame also differs from the work of Lee et al. lee2018stan in that it uses a 3DCAE instead of the bi-directional convolutional LSTM or 2D CAE. The work of Nogas et al. nogas2019deepfall suggests that training LSTM based autoencoders can be slower in comparison to 3DCAE. Our 3DCAE reconstructs the whole sequence of frames given an input sequence of frames instead of producing only one frame and is fed to the 3DCNN discriminator. This way the discriminator is presented with a fully reconstructed sequence of frame, rather than one frame in a sequence to decide if its real or reconstructed. In the next section, we describe the various components of the proposed spatio-temporal adversarial framework for detecting unseen falls.

3 Spatio-Temporal Adversarial Learning

The proposed spatio-temporal adversarial learning framework for identifying unseen falls consists of: (i) training a 3DCAE to reconstruct a sequence of normal ADL video frames and, (ii) a 3DCNN to discriminate the reconstructed sequences with the original sequences of video frames of normal ADL. Both of these components perform 3D Convolution operations. A 3D convolutional layer is defined as follows: the value $\boldsymbol{v}$ at position $(x,y,z)$ of the $j^{th}$ feature map in the $i^{th}$ 3D convolution layer, with bias $b_{ij}$ , is given by the equation ji20133d

[TABLE]

where $P_{i}$ , $Q_{i}$ , $S_{i}$ are the vertical (spatial), horizontal (spatial), and temporal extent of the filter cube $\boldsymbol{w}_{i}$ in the $i^{th}$ layer. The set of feature maps from the $(i-1)^{th}$ layer are indexed by $m$ , and $w_{ijm}^{pqs}$ is the value of the filter cube at position $pqs$ connected to the $m^{th}$ feature map in the previous layer. Multiple filter cubes will output multiple feature maps. Next, we describe the 3DCAE and 3DCNN that are used in the proposed adversarial learning framework to detect unseen falls.

3.1 3DCAE

The spatio-temporal autoencoder used in this paper, 3DCAE, is derived from the works of Nogas et al. nogas2019deepfall . The specification of the convolution filters, number of layers and depth of the network have been reported to work well for the fall detection problem from videos. We used the same baseline and have not added confounding parameters to make the model complex. The input to 3DCAE, $\boldsymbol{I}$ , comprises of a continuous sequence of $t=1,\ldots,T$ frames, called a window. These windows of length $T=8$ are generated by applying a temporal sliding window to input video frames, with padding (or not) and stride (the amount of frames shifted from one window to the next). The input $\boldsymbol{I}$ is encoded by a sequence of 3D convolution layers. The first 3D convolution layer uses 3D convolutions with stride of $1\times 2\times 2$ , and padding, and the rest use stride of $2\times 2\times 2$ , and padding. This means that each dimension (temporal depth, height, and width) is reduced by a factor of $2$ with every 3D convolution layer except the first, which reduces only the spatial dimension, thus allowing for a deeper architecture without collapsing the temporal dimension completely. Decoding operates as encoding but in reverse, using 3D deconvolution layers. The final deconvolution layer combines feature maps into the decoded reconstruction. This final layer uses a stride of $1\times 1\times 1$ and padding. For hidden layers, the activation function $f$ is set to ReLU. We use $P_{i}=Q_{i}=3$ , and $S_{i}=5$ , for all convolutional and deconvolutional layers, as these values were found to produce the best results across all the datasets. Table 1 shows the configuration of the 3DCAE used in our spatio-temporal adversarial framework. The output of the 3DCAE (reconstructed video sequence of size $T$ ) is fed to the 3D discriminator along with the actual input video sequence of size $T$ . Batch normalization is used in all the layers of the 3DCAE except for the final layer.

3.2 3D Discriminator

The discriminator in our setting is a 3DCNN, whose architecture is kept the same as the encoding configuration of the 3DCAE followed by a fully connected layer of one neuron at the end with a sigmoid function to output a probability of whether a sequence of frames is original or reconstructed. Batch normalization is used in all the layers of the 3D discriminator except for the input layer. LeakyRelu activation is set in all hidden layers, with negative slope coefficient set to $0.2$ .

It is to be noted that during the training phase, only the video sequences of normal ADL are presented to the 3DCAE and 3DCNN, whereas during testing phase video sequences may contain both normal ADL and fall frames.

3.3 Adversarial Learning

As discussed previously, the proposed adversarial framework consists of two components; a 3DCAE as a generator and a 3DCNN as a discriminator. Figure 1 shows the setup of the overall adversarial framework, where the autoencoder and discriminator are trained in an adversarial setting. The 3DCAE (represented as $\mathcal{R}$ ) takes the input sequence ( $\boldsymbol{I}$ ) of window size $T$ of normal ADL, and reconstructs the sequence, $\boldsymbol{O}$ , which is then fed to fool 3DCNN (represented as $\mathcal{D}$ ) that it is an original input and not the reconstructed sequence. However, $\mathcal{D}$ will have access to the original input sequence ( $\boldsymbol{I}$ ) and may easily identify the reconstructed sequence as not the original input sequence. Then, the two components play an adversarial game, which after completion of training should enable $\mathcal{R}$ to reconstruct input video sequences with minimum reconstruction error to successfully fool $\mathcal{D}$ . This means that $\mathcal{R}$ should be able to reconstruct output sequence very similar to the input sequence. In other words, the spatio-temporal autoencoder would have learned the concept of normal ADL after successful completion of the training. This further implies that any sequence with anomaly (e.g. fall) would be reconstructed with high reconstruction error. At the same time, the discriminator would have become an expert to identify between the badly reconstructed sequences and the input sequences.

In our setting, $\mathcal{R}$ maps $\boldsymbol{I}$ to $\boldsymbol{O}$ using the distribution of the target class $p$ , i.e.

[TABLE]

However, $\mathcal{D}$ has access to input samples and is exposed to $p$ . Therefore, $\mathcal{D}$ can explicitly decide if $\mathcal{R(O)}$ comes from $p$ or not. The objective function to jointly learn $\mathcal{R}$ and $\mathcal{D}$ can be written as:

[TABLE]

To train the model, we need to calculate the (i) loss due to the 3DCAE ( $\mathcal{L_{R}}$ ), and (ii) loss due to both 3DCAE and the 3DCNN ( $\mathcal{L_{R+D}}$ ). The 3DCAE loss is simply the reconstruction error between the $j^{th}$ frame of $I_{i}$ and $O_{i}$ , and can be written as

[TABLE]

Thus, the total loss function to minimize can be written as:

[TABLE]

where $\lambda$ is a positive number that controls the relative importance of both the loss terms.

For comparison purposes, we implement two other variants of autoencoders to detect unseen falls that are trained as per the proposed adversarial framework. The first variant uses a deep autoencoder as a generator and a multi-layer feed forward network as the discriminator, we call it as DAE-AN. The configuration of the discriminator is the same as the encoder of the deep autoencoder. This method will learn global features from the video sequences to successfully reconstruct ADL. The second variant uses a convolutional autoencoder (CAE) as a generator and a convolutional feed-forward network as a discriminator, we call it as CAE-AN. The configuration of the discriminator, in this case, is the same as the encoder of the CAE (this framework is analogous to the work of Sabokrou et al. Sabokrou2018Adversarially ). This method will learn localized spatial features. The structure of the encoder and decoder for DAE-AN and CAE-AN are shown in Tables 2 and 3. It is to be noted that the input to the DAE-AN and CAE-AN is a frame from the video, whereas the input to the proposed spatio-temporal adversarial learning method is a window consisting of a sequence of $T$ frames, as shown in Figure 1. Therefore, the proposed method will learn both spatial and temporal features when the training is successfully completed.

4 Detecting Unseen Falls

The spatio-temporal framework is trained in an adversarial manner on only normal ADL and an unseen fall is detected as an anomaly during testing. The method to detect unseen falls is shown in Figure 2 (derived from nogas2019deepfall ). All the frames in the video, $Fr_{i}$ , are broken down into windows of frames of length, $T=8$ , with stride= $1$ . For the $i^{th}$ window $\boldsymbol{I}_{i}$ , the 3DCAE outputs a reconstruction of this window, $\boldsymbol{O}_{i}$ . The reconstruction error ( $R_{i,j}$ ) between the $j^{th}$ frame of $I_{i}$ and $O_{i}$ can be calculated as (similar to Equation 4)

[TABLE]

There are two ways to detect unseen falls, (i) at the frame level, or (ii) at the window level, which are described next.

Frame Level Anomaly

: In the frame level anomaly method, the reconstruction error ( $R_{i,j}$ ) (obtained from the 3DCAE) is computed for every frame $j$ across different windows. The average ( $C_{\mu}^{j}$ ) and standard deviation ( $C_{\sigma}^{j}$ ) of a frame $j$ across different windows are used as an anomaly score as follows nogas2019deepfall :

[TABLE]

$C_{\mu}^{j}$ and $C_{\sigma}^{j}$ give an anomaly score per-frame, while incorporating information from the past and future frames. A large value of $C_{\mu}^{j}$ or $C_{\sigma}^{j}$ means that the $j^{th}$ frame, when appearing at different positions in subsequent windows, is reconstructed with a high average error or with highly variable error; thus, indicating the occurrence of a fall. As this method calculates anomaly at the frame level, it is directly comparable with DAE-AN and CAE-AN. For DAE-AN and CAE-AN, the reconstruction error of an input frame is used as an anomaly score.

Window Level Anomaly

: In the window level anomaly method, the score for the entire window of $T$ frames is calculated. For an input $x$ comprising of $T$ frames, this score, can be either of the following:

(i)

Reconstruction error of the 3DCAE, $R(x_{i,j})$ . For a particular window $i$ , the mean of reconstruction error of all the $T$ frames ( $W_{\mu}^{i}$ ) and their standard deviation ( $W_{\sigma}^{i}$ ) are used as an anomaly score, as follows:

[TABLE] 2. (ii)

Probability score of the discriminator 3DCNN, $\mathcal{D}(x)$ , 3. (iii)

Probability score of the discriminator on the reconstructed input, $\mathcal{D}(\mathcal{R}(x))$ Sabokrou2018Adversarially ,

Combination of both autoencoder and discriminator scores, i.e. 4. (iv)

$\mathcal{D}(x)+\lambda\mathcal{R}(x)$ , and 5. (v)

$\mathcal{D}(\mathcal{R}(x))+\lambda\mathcal{R}(x)$

The anomaly scores (iv) and (v) will have two versions each based on the mean and standard deviation of the reconstruction error, represented as $W_{\mu}-\mathcal{D}(x)+\lambda\mathcal{R}(x)$ and $W_{\sigma}-\mathcal{D}(x)+\lambda\mathcal{R}(x)$ , and $W_{\mu}-\mathcal{D}(\mathcal{R}(x))+\lambda\mathcal{R}(x)$ and $W_{\sigma}-\mathcal{D}(\mathcal{R}(x))+\lambda\mathcal{R}(x)$ . The $-$ sign should not be confused with the minus sign; it only shows that this particular scores is derived from the mean or standard deviation of the reconstruction error.

The number of fall frames present in a window ( $\alpha$ ), s.t. the ground truth label of the entire window is a fall is a hyperparameter of the method and will influence the detection of anomalies. Giving a window the ground truth as a fall with low value of $\alpha$ may result in high false alarm rate. Whereas deciding a window as a fall with high value of $\alpha$ may miss some falls. In the experiments, we varied the value of $\alpha$ from $1$ to $T$ to understand the impact of choosing its appropriate value.

5 Experiments and Results

5.1 Datasets

We use the following three datasets to test the proposed spatio-temporal adversarial framework to detect unseen falls. All of these datasets contain videos captured through thermal or depth cameras. Therefore, these datasets are capable of partially or fully obfuscating the identity of the person in the video.

Thermal Dataset: The Thermal dataset Thermal contains $9$ videos with normal ADL and $35$ videos containing falls and other normal activities. These videos are captured using a FLIR ONE thermal camera mounted to an Android phone with a spatial resolution of $640\times 480$ . The videos have a frame rate of either 25 fps or 15 fps, which was obtained by observing the properties of each video. The thermal camera can protect the privacy/identity of the individual and can capture images during night conditions as well. To create sequence of windows to be given as input to the proposed spatio-temporal adversarial framework, sliding window ( $T=8$ ) is performed on all video frames, resulting in $22,116$ frames from $9$ ADL videos. A sample of normal ADL and fall activities from the thermal dataset is shown in Figure 3.

UR Dataset: The UR dataset UR contains $40$ videos of person doing normal ADL (such as walking, sitting down, crouching down, and lying down in bed.) and $30$ videos with a fall in them. Two types of falls were performed by five persons from standing and sitting on the chair. These videos are captured at $30$ fps using a Kinect depth sensor, which obfuscate the identity of the person. The depth map is stored in VGA resolution ( $640\times 480$ ). The UR dataset has many missing pixel regions, called ‘holes’, which were filled using a method based on depth colorization Silberman:ECCV12 . The new version of this dataset obtained after filling the holes is called as UR-filled in this paper. After applying the sliding window ( $T=8$ ), $8,661$ windows of contiguous ADL frames were obtained for training the spatio-temporal adversarial framework. A sample of normal ADL and falls from UR and UR-filled dataset is shown in Figures 5 and 5.

SDU Fall Dataset: In the SDU Fall dataset SDU , ten young men and women did six types of activities $30$ times, resulting in $1800$ video clips. The data shared with us contained $1197$ videos, out of which $997$ had normal ADL and $200$ had falls. The activities included falling, bending, squatting, sitting, lying, and walking. These videos were captured using a Kinect camera (thus hiding person’s identity) at $30$ fps, with video frame size of $320\times 240$ and stored in AVI format. The SDU fall dataset also had holes similar to the UR dataset. However, the information on distance of depth frames is not provided with this dataset; therefore, we used an inpainting approach NS to fill these holes, we call that data as SDU-filled. After applying the sliding window ( $T=8$ ), $163,573$ windows of contiguous frames were obtained. A sample of normal ADL and falls from SDU and SDU-filled is shown in Figures 7 and 7.

In all the datasets, there are empty frames with no person, with person entering from left, right, front far end or a full person in the scene. All the frames in all the datasets are resized to $64\times 64$ , normalized by dividing the pixel value by $255$ to keep them in the range $[0,1]$ , then subtracting the per-frame means from each frame, which keeps the pixel values in the range $[-1,1]$ . The different adversarial trained methods are trained on only the normal ADL frames or their sequences. For testing, videos are presented to the trained network containing both normal ADL and unseen fall frames (or their sequences), which were manually annotated as ground truth. Since a fall is a short event, it can only take few frames for a fall event from start to end. In our datasets, the maximum number of frames for a fall to occur was $13$ . Since we wanted to keep the number of frames to be a power of $2$ ; therefore, we choose $T=8$ as higher values of $T$ would not be possible. Smaller values of T $=4$ resulted in more false alarms and their results are not shown in the paper. In our implementation of the spatio-temporal adversarial learning, we use SGD optimizer with learning rate equals to $0.0002$ for the 3DCNN discriminator, and adadelta optimizer for the 3DCAE. We trained our model on various values of $\lambda$ . Larger values of $\lambda$ lead to mode collapse problem. Therefore, we choose $\lambda=1$ that gave the best results. We train all the adversarial methods for a maximum of $500$ epochs.

5.2 Results

Frame Level Anomaly

: Table 4 shows the Area Under the Curve (AUC) values after applying frame level anomaly scoring method on DAE-AN, CAE-AN and the proposed spatio-temporal adversarial network (on $C_{\mu}$ and $C_{\sigma}$ anomaly score). The best AUC values are shown in gray color cells. We observe that the proposed method performs better than DAE-AN and CAE-AN on all the datasets, except SDU-filled with DAE-AN. The SDU dataset videos contains simple and organic activities, falls always happened from standing, besides having no furniture or background objects in the scene. We hypothesize that due to these reasons the DAE-AN and CAE-AN may be able to learn global and spatial features that may be able to detect falls comparable to the spatio-temporal network. However, the activities in the Thermal and UR datasets were complex, falls happened in various poses (e.g. falling from chair, falling from sitting and falling from standing), and the scene involved different objects in the background (e.g. bed, chair). In the Thermal dataset, due to a person entering the scene, the pixel intensity would change values due to change in the heat in the environment. The proposed spatio-temporal adversarial learning method worked well under these diverse condition to detect unseen falls. We also observe that all the fall detection methods performed worse on original UR and SDU datasets than their holes filled versions. This clearly shows that videos with holes are detrimental to learn normal ADL and identify unseen falls. We further observe that AUC results of the proposed approach are slightly better with $C_{\sigma}$ than $C_{\mu}$ for all the datasets.

Window Level Anomaly

: Figures 8, 9 and 10 show the AUC values of the spatio-temporal adversarial learning on detecting unseen falls on Thermal, UR-Filled and SDU-Filled datasets using window level anomaly scores w.r.t. different choices of $\alpha$ from $1$ to $8$ (which is the maximum size of the window) . The results on UR and SDU with holes were consistently worse and are not shown. We observe that for different anomaly scores for each of the datasets, the AUC initially increases with an increase in the number of fall frames in a window (i.e. $\alpha$ ) and then stabilizes for higher values of $\alpha$ . This is related to the fact that if a window is decided as a ‘fall’ based on very few fall frames, it would lead to many false alarms, resulting in lower AUC. It can be clearly seen that the anomaly score $D(R(x))$ performs worst in all the datasets. Furthermore, in UR-Filled datasets, the two other worse performing scores are $W_{\sigma}-D(R(x))+R(x)$ and $D(x)$ , and in SDU-Filled are $D(x)$ and $W_{\sigma}-D(x)+R(x)$ . Other anomaly scores perform equivalent to each other. This experiment suggests that unseen falls can be detected with high AUC using window level anomaly scoring. However, the scores obtained at the discriminator or when combined with reconstruction error may not be a good candidate for detecting unseen falls.

It is to be noted that the scores of window level anomaly scoring are not directly comparable with frame level scoring method. In the frame level method, the anomaly score is calculated for every frame (occurring at different windows). Whereas in the window level method, we designate the class of the whole window instead of deciding the class of every frame across windows. Another factor in window level anomaly is the number of fall frames present in a window ( $\alpha$ ), s.t. the ground truth label of the entire window is a fall. This parameter is varied and results are shown in Figures 8, 9 and 10. Therefore, these two types of anomaly scoring methods are not directly comparable and their separate results are discussed in the paper.

The proposed framework may detect other abnormal activities as falls that significantly deviates from normal ADL, such as syncope, tripping, or presence of new objects or people in the scene. However, on the datasets we tested, those variation were not present.

6 Conclusions and Future Work

This paper deals with identifying unseen falls in videos using a new spatio-temporal adversarial learning framework. The videos used in this paper are privacy preserving, such as thermal and depth cameras that can partially or fully obfuscate facial features of a person. This further ascertains the idea that for fall detection problem, only spatial and temporal information contained in the video is needed and not the identity revealing information (e.g. face of the person). We present a learning strategy to train the adversarial framework using spatio-temporal autoencoder and a spatio-temporal discriminator. The results on three public datasets suggest high performance in comparison to two other spatial adversarial baselines. Encouraged by the results presented in this paper, we are currently collecting a new dataset on fall detection using multiple types of vision sensing modalities, such as thermal cameras, depth cameras, an IP camera and a RGB camera. These ceiling mounted cameras represent a more realistic scenario of using them in a home-setting. This unique dataset will be made public and will help us comparing different sensing modalities for the problem of fall detection. Furthermore, in future, we will use spatio-temporal residual / U-net networks with attention in an adversarial framework to detect unseen falls and other health related abnormal behaviours.

Bibliography39

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1) Akcay, S., Atapour-Abarghouei, A., Breckon, T.P.: Ganomaly: Semi-supervised anomaly detection via adversarial training. In: Asian Conference on Computer Vision, pp. 622–637. Springer (2018)
2(2) Beggel, L., Pfeiffer, M., Bischl, B.: Robust anomaly detection in images using adversarial autoencoders. In: The European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases. ECML-PKDD (2019)
3(3) Bertalmio, M., Bertozzi, A.L., Sapiro, G.: Navier-stokes, fluid dynamics, and image and video inpainting. In: Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition. CVPR 2001, vol. 1, pp. I–355–I–362 vol.1 (2001). DOI 10.1109/CVPR.2001.990497
4(4) Bogdan Kwolek, M.K.: Human fall detection on embedded platform using depth maps and wireless accelerometer. Computer Methods and Programs in Biomedicine 117 , 489–501 (2014)
5(5) Eide, A.W.W.: Applying generative adversarial networks for anomaly detection in hyperspectral remote sensing imagery. Master’s thesis, NTNU (2018)
6(6) Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1125–1134 (2017)
7(7) Ji, S., Xu, W., Yang, M., Yu, K.: 3d convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (1), 221–231 (2013)
8(8) Khan, S.S., Hoey, J.: Review of fall detection techniques: A data availability perspective. Medical engineering & physics 39 , 12–22 (2017)