Asymmetric Residual Neural Network for Accurate Human Activity   Recognition

Jun Long; WuQing Sun; Zhan Yang; Osolo Ian Raymond

arXiv:1903.05359·cs.CV·June 12, 2019

Asymmetric Residual Neural Network for Accurate Human Activity Recognition

Jun Long, WuQing Sun, Zhan Yang, Osolo Ian Raymond

PDF

TL;DR

This paper introduces ARN, an asymmetric residual neural network designed to improve human activity recognition accuracy by capturing spatial and temporal features through dual-path frameworks, validated on benchmark datasets.

Contribution

The paper proposes a novel asymmetric residual network architecture that effectively captures spatial and temporal features for HAR, demonstrating improved accuracy over existing methods.

Findings

01

ARN outperforms conventional methods on benchmark datasets.

02

Lightweight long time window path maintains high recognition accuracy.

03

Network parameter tuning influences model performance significantly.

Abstract

Human Activity Recognition (HAR) using deep neural network has become a hot topic in human-computer interaction. Machine can effectively identify human naturalistic activities by learning from a large collection of sensor data. Activity recognition is not only an interesting research problem, but also has many real-world practical applications. Based on the success of residual networks in achieving a high level of aesthetic representation of the automatic learning, we propose a novel \textbf{A}symmetric \textbf{R}esidual \textbf{N}etwork, named ARN. ARN is implemented using two identical path frameworks consisting of (1) a short time window, which is used to capture spatial features, and (2) a long time window, which is used to capture fine temporal features. The long time window path can be made very lightweight by reducing its channel capacity, yet still being able to learn useful…

Tables8

Table 1. Table 1: Comparison of different HAR technologies

Method	manual f.¹	high-level f.¹	spatial f.¹	temporal f.¹	unsupervised	supervised
HC [21]	✓	$\times$	$\times$	$\times$	✓	$\times$
CBH [22]	✓	$\times$	$\times$	$\times$	✓	$\times$
CBS [23]	✓	$\times$	$\times$	$\times$	✓	$\times$
AE [24]	✓	$\times$	$\times$	$\times$	✓	$\times$
MLP [25]	✓	$\times$	$\times$	$\times$	✓	$\times$
CNN [14]	$\times$	✓	$\times$	$\times$	✓	✓
LSTM [26]	$\times$	✓	$\times$	$\times$	$\times$	✓
Hybrid [27]	✓	✓	$\times$	$\times$	$\times$	✓
ResNet [20]	✓	✓	$\times$	$\times$	✓	✓
ARN(This Work)	✓	✓	✓	✓	✓	✓

Table 2. Table 2: The layer-parameters of the ARN mdoel.

Stage	narrow path	wide path
Conv1	$64 \times (5, 1)$	$64 \times (5, 1)$
Max-pooling	/2	/2
res1	$[\begin{matrix} 1 \times 1 & 64 \\ 3 \times 1 & 64 \\ 1 \times 1 & 256 \end{matrix}] \times 3$	$[\begin{matrix} 1 \times 1 & 64 \\ 3 \times 1 & 64 \\ 1 \times 1 & 256 \end{matrix}] \times 3$
res2	$[\begin{matrix} 1 \times 1 & 512 \\ 3 \times 1 & 512 \\ 1 \times 1 & 2048 \end{matrix}] \times 4$	$[\begin{matrix} 1 \times 1 & 512 \\ 3 \times 1 & 512 \\ 1 \times 1 & 2048 \end{matrix}] \times 4$
res3	$[\begin{matrix} 1 \times 1 & 256 \\ 3 \times 1 & 256 \\ 1 \times 1 & 1024 \end{matrix}] \times 6$	$[\begin{matrix} 1 \times 1 & 256 \\ 3 \times 1 & 256 \\ 1 \times 1 & 1024 \end{matrix}] \times 6$
res4	$[\begin{matrix} 1 \times 1 & 128 \\ 3 \times 1 & 128 \\ 1 \times 1 & 512 \end{matrix}] \times 3$	$[\begin{matrix} 1 \times 1 & 128 \\ 3 \times 1 & 128 \\ 1 \times 1 & 512 \end{matrix}] \times 3$
concate	global average pool	global average pool
fc.	512	512

Table 3. Table 3: The details of experimental datasets

Datasets

OPPORTUNITY

UniMiB-SHAR

Types of sensors

Custom bluetooth wireless accelerometers,

gyroscopes,

Sun SPOTs and InertiaCube3,

Ubisense localisation system,

A custom-made magnetic field sensor

A Bosh BMA220 acceleration sensor

Numbers of sensors

72

3

Numbers of samples

473K

11771

Acquisition periods

10-20(min)

0.6(min)

Table 4. Table 4: Classes and proportions of the OPPORTUNITY dataset

Class	Proportion	Class	Proportion
Open Door 1/2	1.87%/1.26%	Open Fridge	1.60%
Close Door 1/2	6.15%/1.54%	Close Fridge	0.79%
Open Dishwasher	1.85%	Close Dishwasher	1.32%
Open Drawer 1/2/3	1.09%/1.64%/0.94%	Clean Table	1.23%
Close Drawer 1/2/3	0.87%1.69%/2.04%	Drink from Cup	1.07%
Toggle Switch	0.78%	NULL	72.28%

Table 5. Table 5: Classes and proportions of the UniMiB-SHAR dataset

Class	Proportion	Class	Proportion
StandingUpfromSitting	1.30%	Walking	14.77%
StandingUpfromLaying	1.83%	Running	16.86%
LyingDownfromStanding	2.51%	Going Up	7.82%
Jumping	6.34%	Going Down	11.25%
F(alling) Forward	4.49%	F and Hitting Obstacle	5.62%
F Backward	4.47%	Syncope	4.36%
F Right	4.34%	F with ProStrategies	4.11%
F Backward SittingChair	3.69%	F Left	4.54%
Sitting Down	1.70%

Table 6. Table 6: Hyper-parameters of the learning-based methods on the OPPORTUNITY dataset and UniMiB-SHAR dataset.

Model	parameters
AE	5000^a
AE	5000^a
MLP	2000^a
	2000^a
	2000^a
CNN	((11,1),(1,1),50,(2,1))^b
	((10,1),(1,1),40,(3,1))^b
	((6,1),(1,1),30,(1,1))^b
	1000^a
LSTM	(64,600)^c
	(64,600)^c
	512^a
Hybrid	((11,1),(1,1),50,(2,1))^b
	(27,600)^c
	(27,600)^c
	512^a

Table 7. Table 7: Weighted F 1 subscript 𝐹 1 F_{1} -score performances of different methods on the OPPORTUNITY and UniMiB-SHAR datasets. (n) and (w) denote the narrow and wide path, respectively.

Method	$T$ (time window)	OPPORTUNITY	UniMiB-SHAR
HC [21]	32	84.95	22.83
	64	85.56	22.19
	96	85.69	21.96
CBH [22]	32	84.37	64.51
	64	85.21	65.03
	96	84.66	64.36
CBS [23]	32	85.53	67.54
	64	86.01	67.97
	96	85.39	67.36
AE [24]	32	82.87	68.37
	64	84.54	68.24
	96	83.39	68.39
MLP [25]	32	87.32	73.33
	64	87.34	75.36
	96	86.65	74.82
CNN [14]	32	87.51	74.01
	64	88.03	73.04
	96	87.62	73.36
LSTM [26]	32	85.33	69.24
	64	86.89	69.49
	96	86.21	68.81
Hybrid [27]	32	87.91	73.19
	64	88.17	73.22
	96	87.67	72.26
ResNet [20]	32	88.91	76.19
	64	89.17	76.22
	96	87.67	75.26
ARN	32-96 (n)-(w)	90.29	76.39

Table 8. Table 8: Weighted F 1 subscript 𝐹 1 F_{1} -score performances comparison of ARN with the combinations of different lengths of the slide window on the OPPORTUNITY and UniMiB-SHAR datasets. (n) and (w) denote the narrow and wide path, respectively.

Method	$T$ (n)-(w) (time window)	OPPORTUNITY	UniMiB-SHAR
ARN_1	32-64	90.21	77.23
ARN_2	32-96	90.29	76.39
ARN_3	64-96	90.19	76.04

Equations28

x_{j}^{l + 1} = σ (i \in Maps \sum x_{i}^{l} \otimes K_{ij}^{l} + b_{j}^{l})

x_{j}^{l + 1} = σ (i \in Maps \sum x_{i}^{l} \otimes K_{ij}^{l} + b_{j}^{l})

p (a^{(i)} = j) = \frac{e ^{a_{j}^{(i)}}}{\sum _{K} e ^{a_{k}^{((i))}}}

p (a^{(i)} = j) = \frac{e ^{a_{j}^{(i)}}}{\sum _{K} e ^{a_{k}^{((i))}}}

ℓ = - \frac{1}{N} (l = 1 \sum N j = 1 \sum K 1 {y^{(i)} = j} \cdot lo g p (a^{(i)} = j))

ℓ = - \frac{1}{N} (l = 1 \sum N j = 1 \sum K 1 {y^{(i)} = j} \cdot lo g p (a^{(i)} = j))

f_{t} = σ_{f} (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f})

f_{t} = σ_{f} (W_{f} x_{t} + U_{f} h_{t - 1} + b_{f})

i_{t} = σ_{i} (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i})

i_{t} = σ_{i} (W_{i} x_{t} + U_{i} h_{t - 1} + b_{i})

o_{t} = σ_{o} (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o})

o_{t} = σ_{o} (W_{o} x_{t} + U_{o} h_{t - 1} + b_{o})

\tilde{c}_{t} = σ_{c} (W_{c} x_{t} + U_{c} h_{t - 1} + b_{c})

\tilde{c}_{t} = σ_{c} (W_{c} x_{t} + U_{c} h_{t - 1} + b_{c})

c_{t} = f_{t} \cdot c_{t - 1} + i_{t} \cdot \tilde{c}_{t}

c_{t} = f_{t} \cdot c_{t - 1} + i_{t} \cdot \tilde{c}_{t}

h_{t} = o_{t} \cdot σ_{h} (c_{t})

h_{t} = o_{t} \cdot σ_{h} (c_{t})

y_{l} = h (x_{l}) + F (x_{l}, W_{l})

y_{l} = h (x_{l}) + F (x_{l}, W_{l})

x_{l + 1} = f (y_{l})

x_{l + 1} = f (y_{l})

p (k ∣ x) = \frac{e x p { z _{k} }}{\sum _{i = 1}^{K} e x p { z _{i} }}

p (k ∣ x) = \frac{e x p { z _{k} }}{\sum _{i = 1}^{K} e x p { z _{i} }}

ℓ = - k = 1 \sum K l o g (p (k ∣ x)) q (k ∣ x)

ℓ = - k = 1 \sum K l o g (p (k ∣ x)) q (k ∣ x)

F_{w} = 2 G \sum w_{g} \frac{p _{g} \cdot r _{g}}{p _{g} + r _{g}},

F_{w} = 2 G \sum w_{g} \frac{p _{g} \cdot r _{g}}{p _{g} + r _{g}},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Asymmetric Residual Neural Network for Accurate Human Activity Recognition

Jun Long1, WuQing Sun1, Osolo Ian Raymond1, Zhan Yang1,2

1School of Information Science and Engineering, Central South University, Changsha 410083, China

2Network Resources Management and Trust Evaluation Key Laboratory of Hunan Province

[email protected] Accepted by Information, DOI:10.3390/info10060203

I. Introduction

Human Activity Recognition (HAR) is very important for human-computer interaction and is an indispensable part of many current real-world applications. To overcome the awareness of human-computer interaction, the potential features in the on-body device must be learned. HAR using wearable devices data is at the core of intelligent assistive technology, due to its proliferative applications in smart homes [1], intelligent traffic control [2], medical/health assistance [3, 4], skill based check [5], even in the security field [6]. Particularly for the elderly people who are in remote and need to continuous monitor, HAR can greatly increase their safety [7].

Nowadays, accelerators, gyroscopes and magnetic field sensors are widely utilized in smart phone (e.g., Apple iPhone, Samsung Galaxy, Huawei P/Mate), smart bracelets (Apple Watch, Fitbit). With the increasing number of wearable sensors and Internet of Things (IoT) devices, there is a growing trend in collecting the activity data of users in real time. The key technology in HAR includes a sliding time window of time-series data captured with on-body sensors, manually designed feature extraction procedures, and a wide variety of supervised learning methods.

In the past years, researchers have made lots of progress in wearable activity recognition, using algorithms such as Logistic Regression [40], Decision Tree [8], and Hidden Markov Model [9]. Reference [7] carried out experiment to evaluate the recognition performance of supervised and unsupervised machine learning techniques. For task identification, many of traditional methods (i.e., hand-crafted features and codebook approach) are characterized by manual feature extraction and discover features with expert knowledge. The performance of those methods often depends on the quality of the obtained feature representation. However, manual feature extraction is not always possible in practice, especially when we are unknown about the structure of the input data because of a lack of expert knowledge. Compared with manual feature extraction methods, deep learning techniques can discover adequate features without expert knowledge and systematic exploration of the feature space. In the many fields, deep learning techniques have achieved remarkable results, such as in image recognition [10], speech recognition [11], natural language processing [12] and so on. Exiting deep learning methods for HAR can be further divided into two categories: Deep Neural Networks (DNNs) [13], and Convolutional Neural Networks (CNNs) [14]. DNNs (e.g., contain an input layer, at least two hidden layers and an output layer) can extract highly abstract features by stacking hidden layers. It allows us to add more possible connections between input and output neurons so that the ways to re-use learned features can be increased. Researchers who have used DNNs methods for HAR include: [15] who investigated deep neural networks with wearable sensors data and [16] who explored temporal deep neural networks for active biometric authentication.

Recently, many researchers have adopted CNNS to deploy human activity recognition system, such as [6, 15, 17, 18]. CNNs can model the entire sequence by sharing the weights from local to global, extract abstract features at hierarchical layers through a series of convolutional operations, and process the raw activity signals for capturing potential features. CNNs are based on the discovery of visual cortical cells and retain the spatial information of the data through receptive field. It is known that the power of CNNs stems in large part from their ability to exploit symmetries through a combination of weight sharing and translation equivariance. Also, with their ability to act as feature extractors, a plurality of convolution operators is stacked to create a hierarchy of progressive abstract features. Reference [14] proposed CNNs based approaches to automatically extract discriminative features for HAR. More and more researches are using variants of CNNs to learn sensor-based data representations for human activity recognition and have achieved remarkable performances. The model [19] consists of two or three temporal-convolution layers with a ReLU activation function followed by a max-pooling layer and a soft-max classifier, which can be applied over all sensors simultaneously. Yang et al. [17] introduce four temporal-convolutional layers on a single sensor, followed by a fully-connected layer and soft-max classifier and it shows that deeper networks can find correlations between different sensors.

However, all the deep learning methods we mentioned above are all identified by a single-path neural net without considering spatial and temporal features of the data. Inspired by the biological study of retinal ganglion cells in the primate visual system (Figure 1 illustrates the basic motivation source of our proposed ARN), there are Parvocellular (P-cells) that provide good spatial detail and color in the visual system, but its resolution is very low. In addition, there are high-frequency Magnocellular (M-cells), which are very sensitive to time changes, but not sensitive to spatial details and colors. In this paper, inspired by the facts above, we propose a model to handle a set of activity data, synchronized by an asymmetric net, using a short time window to capture spatial features and a long time window to capture fine temporal features, corresponding to the P-cells and the M-cells, respectively. Our network is an end-to-end network, and the input of the network is the original sensor data. The data collected from the wearable device can be directly input into the network. ARN model is applicable for supervised learning approaches and unsupervised learning approaches because it is based on ResNet [20] that has been proven to be applicable for supervised and unsupervised learning approaches. In this paper, we use supervised learning method. Because it can make label information bridge the heterogeneous gap.

We propose a novel Asymmetric Residual Network, named ARN. As a new kind of deep learning network, the components of activity recognition in ARN are divided into two parts. (1) a residual net using short time window (i.e., 32 or 64); (2) a residual net using long time window (i.e., 64 or 96). 32, 64, 96 are meaning the length of the time window111(Length) = (Time) x (Sampling Frequency). The last layer representations of two parts will be concatenated, then use the fusion representations for accurate activity recognition. The superior advantages of the ARN over other existing methods were listed in the Table 1. To the best of our knowledge, this is the first work that applies a asymmetric residual net for activity recognition.

The main contributions of the paper are as follows:

We propose a novel symmetric neural network based on ResNet for HAR, termed ARN, which is an asymmetric network and has two paths separately working at short and long slide window, our wide path is designed to capture global features but few spatial details, analogous to M-cells, and our narrow path is lightweight, similar to the small receptive field of P-cells. 2. 2.

We design a network that consists of asymmetric residual net that not only can effectively manage information flow, but will also automatically learn effective activity feature representation, while capturing the fine feature distribution in different activities from wearable sensor data. 3. 3.

We compare the performance of our method with other relevant methods by carrying out extensive experiments on benchmark datasets. The results show our method outperforms other methods.

The remainder of this paper is structured as follows. In Section II, we briefly introduce the related works. In Section III, we highlight the motivation of our method and provide some theoretical analysis for its implementation. In Section IV, we introduce our experimental results and corresponding analysis and finally in Section V concludes the paper.

II. Related Work

i. Methods for Human Activity Recognition

This section introduces the features extraction methods for HAR selected in our comparative study. There are two main directions for HAR methods: conventional recognition methods and learning-based methods. In conventional methods, we extract features manually with expert knowledge. In Learning-based methods, we can discover adequate features without expert knowledge and systematic exploration of the feature space. Conventional methods in our comparative study include Hand-Crafted Features (HC) [21], Codebook approach (CB) [22]. The learning-based methods include Autoencoders approach (AE) [24], Multi-Layer Perceptron (MLP) [25], Convolutional Neural Network (CNN) [14], Long-Short Term Memory Networks (LSTM) [26], Hybrid Convolutional and Recurrent Networks (Hybrid) [27], Deep Residual Learning (ResNet) [20].

i.1 Hand-Crafted Features

HC comprises simple metrics computed on data and uses simple statistical value (e.g., std, avg, mean, max, min, median, etc.) or frequency domain correlation features based on the signal Fourier transform to analyze the time series of human activity recognition data. Due to its simplicity to setup and low computational cost, it is still being used in some areas, but the accuracy cannot satisfy the requirement of modern AI games. In addition, when faced with the activity recognition of complex high-level behaviors tasks, identifying the relevant features through these traditional approaches is time-consuming.

i.2 Codebook approach

CB consists of two consecutive steps. The first step is codebook construction which is to construct codebook by using cluster algorithm to process a set of subsequences extracted from the original data sequence. Each center of the cluster is considered as a codeword which represents a distinct subsequence. The second step is codeword assignment which aims to built a feature vector that is associated to a data sequence. Subsequences are firstly extracted from the sequence, and then assign each subsequence to the most similar codeword. Finally, a histogram-based feature representing the distribution of all codewords is built by using this information. During the codebook construction, a set of subsequences can be firstly extracted from the original data sequence by using a sliding time window approach with window size $w$ and sliding stride $h$ . Then, a k-means clustering [28] algorithm can be applied on subsequences to obtain $n$ clusters of similar subsequences. Similarity metric between two subsequences is Euclidean distance. Finally, we can get a codebook that consists of $n$ codewords.

During the codebook assignment, we should firstly extract subsequences from a sequence using the same sliding window approach as the codebook construction, and the most similar codeword need to be assigned for each subsequence. Then, a histogram of the frequencies of codewords can be built by using this information. Finally, we can get a probabilistic feature presentation by normalizing the histogram.

The approach of described above can be called codebook with hard assignment (CBH) because each subsequence is assigned to a codeword deterministically. However, this approach may lack flexibility in some uncertain situations where a subsequence is similar to two or more codewords. In order to solve this problem, we use a soft assignment variant (CBS). CBS can exploit kernel density estimation [23] to perform smooth assignment of subsequences to multiple codewords. It allows us to obtain a feature which represents a smooth distribution of codewords that considers the similarity between all codewords and subsequences.

i.3 Autoencoders approach

AE is a specific architecture that consists an encoder and a decoder as depicted in Fig 2. The encoder can project input data in a feature space of lower dimension. While the decoder can map the encoded features back to the input space. Then, the AE can reproduce the input data on the output according to a loss function like Mean Squared Errors after t raining.

i.4 Multi-Layer Perceptron

MLP is one of the most simplest neural networks. An important feature of the MLP is that it has multiple layers. As show in Fig 3, we call the first layer as input layer, the last layer as output layer, and the middle layers as hidden layers. MLP does not specify the number of hidden layers, so you can choose the appropriate number of hidden layers according to your needs. Each neuron in a fully-connected layer takes the outputs of the previous layer as its inputs. Stacking layers can be seen as extracting features of an increasingly higher level of abstraction and the output features of neurons at $n$ -th layer can be calculated by neurons at $(n-1)$ -th layer.

i.5 Convolutional Neural Networks

CNNs can automatically extract the features from raw sensor data which without need for very professional expert knowledge [29]. A standard convolutional neural network consists of convolutional layers, max-pooling layers, fully-connection layers (FC) and a Soft-Max layer. Instead of using predefined filters as in traditional feature extracting methods, CNNs can learn locally connected neurons that represent data-specific filters. As CNNs can share weights of neurons, the parameters of CNNs are much fewer than those of the traditional neural networks [30].

Convolutional layers are an important component of CNNs. Using several convolution filters (or kernels), which aim to learn feature representations of the raw input, complex operations can be easily performed by the convolution operation in the convolutional layer. The dimension of filters (or kernels) is determined by the input dimension. Convolution kernel is a function that generalizes a linear model for the underlying local patch. It works well for abstraction, when instances of latent concepts are linearly separable. In each convolutional layer, neurons of current layer are connected to the neurons of previous layer through feature mapping operation. Thus, feature mapping of the upper layer can be obtained from the convolved results of the previous layer by adopting an element-wise nonlinear activation function. So, the value of the feature map $j$ in the $l$ -th layer, $x_{j}^{l+1}$ is calculated by:

[TABLE]

where maps are the total number of feature maps in $l$ -th layer and $b_{j}^{l}$ is a bias vector. $\sigma(\cdot)$ is the activation function to improve the performance of CNNs. The most notable non-liner activation function is ReLU, which is defined as: $\sigma(x)=\text{max}(x,0)$ . The ReLu activation operation allows networks to compute much faster than sigmoid or tanh activation functions, induces the sparsity in the hidden neurons, and makes networks to obtain sparse representations more easily. Adopting ReLU may bring zero value to affect the performance of backpropagation, but many research results have show that ReLU [31] works much better than sigmoid and tanh [32].

Pooling layers have come after the convolutional layer, is another component of CNNs. In the pooling layer, a pooling operation is used to reduce the number of neurons connections between neighboring convolutional layers thus reducing computational complexity.

Fully-connected layers, whcih aims to convert the matrix-feature (2-D) unfolded to a vector-feature (1-D) for anastomosis classification tasks, and contains about 90% of the parameters of the entire CNNs.

Loss function plays an important role in different classification tasks. The most common loss function is soft-max. Given a training set $\{x^{(i)},y^{(i)};i\in[1,N],y^{(i)}\in[1,K]\}$ , where $x^{(i)}\in\mathbb{R}^{D}$ is the input patch, $y^{(i)}$ is the target label which belongs to the total number of labels (K). The prediction $a_{j}^{(i)}$ of $j$ -th class for $i$ -th input is transformed with the Soft-max function:

[TABLE]

Soft-max normalizes the predictions to a probability distribution over the total classes. The soft-max is represented loss as follows:

[TABLE]

Regularization is required in CNNs. Overfitting is an unavoidable problem in convolutional neural networks, that but it can be effectively reduced by regularization. As a means of regularization, dropout can prevent the dependence of different neurons in a network, and force the network to be more accurate even in the absence of certain information.

i.6 Recurrent Neural Networks and Long-Short Term Memory Networks

Recurrent Neural Networks (RNNs) are a specific architecture that connections between neurons have directed cycles and the output of the neurons dependent on the state of the network at the previous timestamps. RNNs can find patterns with long-term dependencies because its specific behavior that memorizes the information extracted from the past data. But in practice, there is a phenomenon called gradient explosion or vanish will make a great affect on performance of RNNs. The problem of vanishing or exploding gradient refers to the derivate of the error function with respect to the network weight becomes very large or close to zero [33]. This problem will result in the adverse impact on the weight update by the back-propagation algorithm. Therefore, LSTM is designed to solve the problem of vanishing or exploding gradient in RNNs. LSTM extends RNNs with memory cells and remembers information over time by storing it in an internal memory. The internal state can be updated and erased depending on their input and the state at the previous time step. As show in Fig 4, this mechanism is achieved by introducing internal processor called cell. A cell contains three gates, called input gate ( $i_{t}$ ), output gate ( $o_{t}$ ) and forget gate ( $f_{t}$ ). $c_{t}$ is the cell state. Gates are used to regulate the information update to the cell state. Their equations are mentioned below:

[TABLE]

where $\cdot$ represents the element-wise multiplication of two vectors. $x_{t}$ refers to the input vector to the LSTM cell at time t and $h_{t}$ is the hidden state vector. $\sigma$ designates the activation function. $W$ , $U$ and $b$ are the matrices of weights and biases.

i.7 Hybrid Convolutional and Recurrent Networks

Hybrid comprises convolutional, LSTM and softmax layers as depicted in Fig 5. Convolutional layers have the ability to extract the features from input data and create a hierarchy of progressively more abstract features by stacking several convolutional operators. LSTM includes a memory to model temporal dependencies in time series problems. Therefore, the combination of CNN and LSTM can capture time dependencies on features extracted by convolutional operations.

i.8 Deep Residual Learning

ResNet is designed to address degradation problem that the accuracy of training set gets saturated or even decreases with the network depth increasing. Different from the ordinary convolutional neural network, ResNet has many stacked Residual blocks, in which identity mappings are added to connect input and output. Residual block with identity mapping can be expressed in a general form:

[TABLE]

where $x_{l+1}$ and $x_{l}$ are output and input of the $l$ -th unit, $F$ is a residual function and $W_{l}$ are parameters of the unit. $h\left(x_{l}\right)=x_{l}$ is an identity mapping and $f$ is a ReLU activation function. The key idea of ResNet is to learn the additive residual function $F$ with respect to $h\left(x_{l}\right)$ . ResNet can make the element-wise addition on input and output by attaching a shortcut connection. This simple addition can increase the training speed of the model and improve the training effect, and will not add additional parameters to the network. With the network depth increasing, this simple structure is a good solution to the degradation problem.

III. Asymmetric Residual Network

In this section, our proposed model has a narrow path (see Sec ii) and a wide path(see Sec iii), which are concatenated and sent to the fully-connected layer. Loss function is introduced in Sec v.

i. Network Architecture

As shown in Fig 6, convolutional layers and residual layers in our architecture are used to model the recognition task. In convolutional layers, the general activity features are extracted from raw sensor data. In residual layers, the special features can be extracted from general features and the special features are used for human activity recognition.

The convolutional layers (i.e., conv. layers) of our architecture consists of 1 layer, including 64 sliding windows (filters) whose size is $s=5\times 1$ , a batch normalization layer, and a ReLU layer with use of a pooling layer [29]. The residual layers contain four “blocks”. The details of residual block are shown in Table 2 and the value of $n_{1},n_{2},n_{3},n_{4}$ are set to $3,4,6,3$ , respectively.

ii. Narrow Path

The narrow path can be any convolutional model (e.g., Reference [34] introduced a new Two-Stream inflated 3D Convolutional Networks: filters and pooling kernels of very deep image classification Convolutional Networks are expanded into 3D, Reference [35] introduced spatiotemporal ResNets as a combination of Two-stream Convolutional Networks and ResNets, Reference [36] introduced non-local operations as a generic family of building blocks for capturing long-range dependencies.) that works on a sequence data as a spatiotemporal volume. The key concept in our narrow path is a short slide time window to scan the sequence activity data. We can know form [37] that feature learning methods can get good performance when $T=32,64,96$ . Therefore, a typical value of $T$ we study is 32 [37]. Denoting the number of sensor channels as $S$ , the raw clip length is $T\times S$ . The function of this path is to throw compact information into the net, the purpose is to capture spatial features.

iii. Wide Path

In parallel to the narrow path, the wide path is another convolutional model with a long slide time window. The operations of two path net work on the same raw activity data sequences, so the wide path uses $\alpha T$ slide time window, $\alpha$ times longer than the narrow path. A typical value is $\alpha=3$ [38] in our experiments. The presence of $\alpha$ is in the key of the NarrowWide concept. It explicitly indicates that the two paths work on different time window. Our wide path enters a long sequence of activity data into the net in order to pursue global functionality throughout the net hierarchy. Our wide path is distinguished from existing methods in that it can use significantly lower channel capacity to achieve good accuracy for the ARN model. The low channel capacity can also be interpreted as a weaker ability of representing spatial semantics. Our wide path not only has a long slide time window, but also pursues high-dimension features throughout the network hierarchy, maintaining temporal fidelity as much as possible.

iv. Lateral Concatenation

Our lateral concatenation fuses from the narrow path to the wide path. We denote the representation shape of the narrow path as $\{T,S\}$ , the representation shape of the wide path is $\{\alpha T,S\}$ . The output of the lateral concatenation is fused into the narrow path by concatenation. Therefore, the shape of the concatenation layer is $\{(1+\alpha)T,S\}$ .

v. Loss Function

In order to train classification models, classification objectives (such as logistic loss and softmax loss) have been widely explored. For accurate human activity recognition, using labels that are different from the ground-truth for prediction, cannot contribute to the update of the network parameters. For depth estimates, predictions that are close to the ground-truth labels also help to update network parameters. In this work, we employ softmax loss for training the human activity recognition model. For each training sequence $x$ , the probability of each label $k\in\{1,2,...,\mathcal{K}\}$ in our model is computed via softmax:

[TABLE]

where $z_{i}$ are the logits or unnormalized log probabilities. Here, the $z_{i}$ are computed by adding a fully-connected layer on top of the sequence data embedding, i.e., $z_{i}=W_{i}^{T}\phi(x)+b_{i}$ , where $W_{i}$ and $b_{i}$ are weights and bias for target label, respectively. Let $q(k|x)$ denote the ground-true distribution over classes for this training example such that $\sum_{i=1}^{\mathcal{K}}q(k|x)=1$ . The cross-entropy loss for the example is computed as:

[TABLE]

IV. Experiments

In order to demonstrate the performance of our proposed ARN method, we carried out our extensive experiments on two widely used benchmark datasets, i.e., OPPORTUNITY and UniMiB-SHAR, to verify the effectiveness of our method.

i. Dataset

Human activity features are usually unique and cyclical, and natural human activities include walking, running, jumping and so on. Therefore, a set of active data that includes a variety of types of natural human activities should be considered in dataset construction.

We use benchmark datasets to validate the model performance, and use different action sequences to verify whether they belong to the same person. There are many benchmark activity datasets, such as OPPORTUNITY [39], WISDM [40], UniMiB-SHAR [41], MHEALTH [42], PAMAP2 [43] datasets. In this paper, we evaluate our method by using the following two datasets.

OPPORTUNITY dataset has been widely used in many researches. It contains four subjects performing 17 different (morning) Activities of Daily Living (ADLs) in a sensor-rich environment, as listed in Table 3 4. They were acquired at a sampling frequency of 30Hz equipping 7 wireless body-worn inertial measurement units (IMUs). Each IMU consists of a 3D accelerometer, 3D gyroscope and a 3D magnetic sensor, as well as 12 additional 3D accelerometers placed on the back, arms, ankles and hips, accounting for a total of 145 different sensor channels. During the data collection process, each subject performed a session 5 times with ADL and 1 drill session. During each ADL session, subjects were asked to perform the activities naturally-named “ADL1” to “ADL5”. During the drill sessions, subjects performed 20 repetitions of each of the 17 ADLs of the dataset. The dataset contains about 6 hours of information in total, and the data are labeled on a timestamp level. The dataset can be used in an open activity recognition recognition challenge where participants competed to achieve the highest performance on the recognition. In our experiment, the training and testing sets have 63-Dimensions (36-D on hand, 9-D on back and 18-D on ankle, respectively).

UniMiB-SHAR dataset was collected data from 30 healthy subjects (6 male and 24 female) acquired using the 3D-accelerometer of a Samsung Galaxy Nexus I9250 with Android OS version 5.1.1. It contains 11771 samples of both human activities and falls performed by 30 subjects of ages ranging from 18 to 60. The data are sampled at a constant sampling rate of 50 Hz, and split into 17 different activity classes, 9 safety activities and 8 dangerous activities (e.g., a falling action) as shown in Table 3 5. Unlike the OPPORTUNITY dataset, the dataset does not have any NULL class and remains relatively balanced. It allows researchers to work to more robust features and classification schemes. In our experiments, the training and testing sets have 3-Dimensions.

The OPPORTUNITY dataset and UniMiB-SHAR dataset are collected from real environment. The two datasets have their own characteristic and contain different sensors, the UniMiB-SHAR dataset only contains the accelerometer data, it has low power cost. The OPPORTUNITY dataset combines accelerometers, gyroscopes and magnetic sensors data, and it can provide accurate limb orientation.

ii. Baseline

We compared our proposed ARN method against some classic or state-of-the-art activity recognition methods. We roughly divided these methods into categories: conventional recognition methods include HC [21], CBH [22], CBS [23]. The learning-based methods include AE [24], MLP [25], CNN [14], LSTM [26], Hybrid [27], ResNet [20]. As in conventional methods, we use hand-crafted features, readers can find more details in [37]. For learning-based methods, we use raw activity data as input. Follow by [37], the hyper-parameters of these learning-based baseline models except ResNet222The hyper-parameters used by ResNet are the hyper-parameters used in one of the path in the proposed ARN model. for the OPPORTUNITY and UniMiB-SHAR datasets are provided in Table 6.

iii. Implementation and Setting

Our ARN model is implemented in TensorFlow [44], a system that transfers complex data structures to artificial intelligence neural networks for analysis and processing. The computing platform is equipped with an Intel 2 $\times$ Intel E5-2600 CPU, 128G RAM, and a NVIDIA TITAN Xp 12G GPU. The model is trained using the ADADELTA gradient decent algorithm with default parameters (i.e., initial learning rate of 1), for 50 epoches. The batch size is set to 128. The hyper-parameters of the proposed model are provides in Table 2.

Sliding Time Window Size: The length of the sliding window $T$ is an important hyper-parameter of the proposed model. As in baseline methods, we carried out two more comparative studies using $T=32\ \ (approximately\ \ 1s)$ , $T=64\ \ (approximately\ \ 2s)$ and $T=96\ \ (approximately\ \ 3s)$ . For the proposed model, we use $T=32$ or $T=64$ as the hyper-parameter of the narrow path and $T=64$ or $T=96$ as the hyper-parameter of the wide path, respectively.

iv. Performance Measure

ADL datasets are often highly unbalanced. The OPPORTUNITY dataset is extremely imbalanced, as the NULL class represents more than 75% of the recorded data. For this dataset, the overall classification accuracy is not an appropriate measure of performance, because the activity recognition rate of the majority classes might skew the performance statistics to the detriment of the least represented classes. As a result, many previous researches such as [29] show the use of an evaluation metric independent of the class repartition— $F1$ -score. The $F1$ -score combines two measures: the precision $p$ and the recall $r$ : $p$ is the number of correct positive examples divided by the number of all positive examples returned by the classifier, and $r$ is the number of correct positive results divided by the number of all positive samples. The $F1$ -score is the harmonic average of $p$ and $r$ , where the best value is at 1 and worst at 0. In this paper, we use an additional evaluation metric to make the comparison with them easier: the weighted $F1$ -Score (Sum of class $F1$ -scores, weighted by the class proportion):

[TABLE]

where $w_{g}=N_{g}/N_{total}$ and $N_{g}$ is the number of samples in class $g$ , and $N_{total}$ is the total number of samples.

v. Results and discussions

In this section, we present and discuss the results. To get insight into how these methods are applied to the domain, we show the performance of these methods and evaluate some key parameters.

The weighted $F_{1}$ -score of all models on OPPORTUNITY and UniMiB-SHAR are listed in Table 7. Results on these datasets show that the proposed ARN method substantially outperforms all other methods against which it was compared. Compared to conventional recognition methods, such as CBS, the best conventional method achieves an absolute boost of 4.98%, and 14.65% corresponding to the OPPORTUNITY dataset and the UniMiB-SHAR dataset, respectively. In addition, most of the learning-based recognition methods outperform the conventional recognition methods. In particular, for OPPORTUNITY dataset, the Hybrid method achieves the best performance among all the learning-based methods. Compared to Hybrid method, our ARN method achieves boosts of 2.4%. For UniMib-SHAR dataset, the MLP method achieves the best performance among all the learning-based methods. Compared to MLP method, our ARN method achieves boosts of 2.48%. We also compared a single-path residual network i.e. ResNet, our ARN achieves an absolute boost of 1.26%, and 1.32% corresponding to the OPPORTUNITY dataset and the UniMiB-SHAR dataset, respectively.

From the Table 7, we can observe that the gap between the learning-based methods and conventional methods is larger on the UniMiB-SHAR dataset than OPPORTUNITY dataset. The reasons are that the sensor channels in OPPORTUNITY dataset are more than those in UniMiB-SHAR dataset. By carefully comparing the performance of the results, we found that our proposed method showed a higher degree of performance improvement when tested on UniMiB-SHAR dataset compared to OPPORTUNITY dataset. This means that our method is effective.

We also observe from the Table 7 that different lengths of slide time window have an impact on the performance of the activity recognition. The short time window contains too little information. With the growth of the time window, the window contains more and more information, and the accuracy is improved accordingly. But using longer slide time window does not yield better recognition performances [DBLP:journals/sensors/LiSNKG47]. Most methods perform best in human activity recognition tasks when $T=64$ . The reasons are that longer frames potentially contain data related to a higher number of activities, making their majority-labeling more inaccurate.

vi. Hyperparameters Evaluation

Impact of the length of the narrow and wide path selection: The model we proposed has two paths, one is narrow path (i.e., the length of slide time window is short), another is wide path (i.e., the length of the slide window is wide). In order to verify the impact of different lengths of slide time windows combinations on the results. We leverage a combination of slide time window of different lengths for comparison experiments (i.e., 32-64, 32-96, 64-96). ARN_1, ARN_2 and ARN_3 mean the combination of slide time window are 32-64, 32-96 and 64-96, respectively. We carried out experiments on the two datasets. The weighted $F_{1}$ -score results are shown in Table 8. From Table 8, we can observe that on the OPPORTUNITY dataset, the ARN_2 outperforms the ARN_1 and ARN_3, the minimum image size is $63(D)$ $32(T)$ and the accuracy is 90.29%. For UniMiB-SHAR dataset, the ARN_1 outperforms the ARN_2 and ARN_3, the minimum image size is $3(D)$ $32(T)$ and the accuracy is 76.39%. The performance gap between the three experiments was very small. This indicates that ARN is a stable model that is not sensitive to the lengths of slide time window.

V. Conclusions

In this paper, we propose a novel asymmetric residual network for activity recognition using wearable device data, named ARN. To improve the accuracy of activity recognition, our method consists of two paths. The first path uses a short time window to capture spatial features, and the second path uses a long time window to capture fine temporal features. Unlike other learning-based methods, ARN considers the spatial and temporal features of the data at the same time. It can effectively manage information flow and automatically learn activity feature representation. ARN is an end-to-end network, and the data collected by the wearable device can be directly input into the network. Comprehensive experiments on the two benchmark human activity recognition datasets demonstrate that the ARN outperforms the compared baselines. This method has a good application prospect. However, biometric identification, fingerprint recognition, iris recognition and other technologies have achieved more than 98% accuracy and are widely used in people’s life. Compared to the practical applications of recognition field in society, the accuracy of ARN cannot meet. Therefore, there is still a lot of room for progress.

For future work we may research how to extract more fine-gained features by using attention strategy and focus on the research of data dynamic fusion algorithm for maximizing the retention of the data features and obtaining higher recognition accuracy. In addition, we will recognize human dangerous activities, but these activities recognition involves many fine-gained feature extractions. Therefore, we may take advantage of the memory mechanism to design a memory-augmented neural network [45] that can learn to find supporting pre-stored clews (i.e., representations of historical activities or others).

Bibliography45

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Sukor, A.S.A.; Zakaria, A.; Rahim, N.A.; Kamarudin, L.M.; Setchi, R.; Nishizaki, H. A hybrid approach of knowledge-driven and data-driven reasoning for activity recognition in smart homes. Journal of Intelligent and Fuzzy Systems 2019, 36, 4177-4188. doi:10.3233/JIFS-169976.
2[2] Xiao, Z.; Lim,H.B.; Ponnambalam, L. Participatory Sensing for Smart Cities: A Case Study on Transport Trip Quality Measurement. IEEE Trans. Industrial Informatics 2017, 13, 759-770. doi:10.1109/TII.2017.2678522.
3[3] Fortino, G.; Ghasemzadeh, H.; Gravina, R.; Liu, P.X.; Poon, C.C.Y.; Wang, Z. Advances in multi-sensor fusion for body sensor networks: Algorithms, architectures, and applications. Information Fusion 2019, 45, 150-152. doi:10.1016/j.inffus.2018.01.012.
4[4] Qiu, J.X.; Yoon, H.; Fearn, P.A.; Tourassi, G.D. Deep Learning for Automated Extraction of Primary Sites From Cancer Pathology Reports. IEEE J. Biomedical and Health Informatics 2018, 22, 244-251. doi:10.1109/JBHI.2017.2700722.
5[5] Oh, I.; Cho, H.; Kim, K. Playing real-time strategy games by imitating human players’ micromanagement skills based on spatial analysis. Expert Syst. Appl. 2017, 71, 192-205. doi:10.1016/j.eswa.2016.11.026.
6[6] Lisowska, A.; O’Neil, A.; Poole, I. Cross-cohort Evaluation of Machine Learning Approaches to Fall Detection from Accelerometer Data. Proceedings of the 11th International Joint Conference on Biomedical Engineering Systems and Technologies (BIOSTEC 2018) - Volume 5: HEALTHINF, Funchal, Madeira, Portugal, January 19-21, 2018., 2018, pp. 77-82. doi:10.5220/0006554400770082.
7[7] Attal, F.; Mohammed, S.; Dedabrishvili, M.; Chamroukhi, F.; Oukhellou, L.; Amirat, Y. Physical Human Activity Recognition Using Wearable Sensors. Sensors 2015, 15, 31314-31338. doi:10.3390/s 151229858.
8[8] Bao, L.; Intille, S.S. Activity Recognition from User-Annotated Acceleration Data. Pervasive Computing, Second International Conference, PERVASIVE 2004, Vienna, Austria, April 21-23, 2004, Proceedings, 2004, pp. 1-17. doi:10.1007/978-3-540-24646-6_1.