Separable Convolutional LSTMs for Faster Video Segmentation

Andreas Pfeuffer; Klaus Dietmayer

arXiv:1907.06876·cs.CV·July 17, 2019

Separable Convolutional LSTMs for Faster Video Segmentation

Andreas Pfeuffer, Klaus Dietmayer

PDF

1 Repo

TL;DR

This paper introduces separable convolutional LSTMs to improve the speed of video segmentation, reducing parameters and computation while maintaining accuracy, and proposes a new flickering pixel metric.

Contribution

It generalizes separable convolution techniques for convLSTMs, significantly reducing parameters and FLOPs in video segmentation networks.

Findings

01

Achieves up to 15% faster inference on GPU with similar accuracy.

02

Reduces model complexity by applying separable convolution techniques.

03

Introduces a new metric for flickering pixels in video segmentation.

Abstract

Semantic Segmentation is an important module for autonomous robots such as self-driving cars. The advantage of video segmentation approaches compared to single image segmentation is that temporal image information is considered, and their performance increases due to this. Hence, single image segmentation approaches are extended by recurrent units such as convolutional LSTM (convLSTM) cells, which are placed at suitable positions in the basic network architecture. However, a major critique of video segmentation approaches based on recurrent neural networks is their large parameter count and their computational complexity, and so, their inference time of one video frame takes up to 66 percent longer than their basic version. Inspired by the success of the spatial and depthwise separable convolutional neural networks, we generalize these techniques for convLSTMs in this work, so that the…

Tables4

Table 1. TABLE I: FLOPs of a standard convLSTM layer

operation	total	FLOPs
Convolutions	$8$	$𝟖 \cdot 2 \cdot K_{x} \cdot K_{y} \cdot I \cdot O \cdot D_{x} \cdot D_{y}$
Hadamard Product	$3$	$𝟑 \cdot O \cdot D_{x} \cdot D_{y}$
Sigma Operation	$3$	$𝟑 \cdot 5 \cdot O \cdot D_{x} \cdot D_{y}$
TanH Operation	$2$	$𝟐 \cdot 5 \cdot O \cdot D_{x} \cdot D_{y}$
Additions	$9$	$𝟗 \cdot O \cdot D_{x} \cdot D_{y}$
total		$(16 \cdot K_{x} \cdot K_{y} \cdot I + 37) \cdot O \cdot D_{x} \cdot D_{y}$

Table 2. TABLE II: Evaluation on Cityscapes

approach	acc. (%)	mIoU (%)	mFIP (‰)
ICNet [24]	$92.50$	$60.07$	$64.42$
LSTM-ICNet v2 [14]	$92.96$	$61.93$	$62.67$
spatial-LSTM-ICNet v2	$92.80$	$61.70$	$63.26$
depth-LSTM-ICNet v2	$92.56$	$61.73$	$61.17$
sep-LSTM-ICNet v2	$92.47$	$60.18$	$59.77$
LSTM-ICNet v5 [14]	$92.74$	$61.53$	$62.67$
spatial-LSTM-ICNet v5	$92.59$	$61.56$	$62.80$
depth-LSTM-ICNet v5	$92.52$	$60.67$	$61.33$
sep-LSTM-ICNet v5	$92.54$	$60.98$	$59.74$
LSTM-ICNet v6 [14]	$92.69$	$61.39$	$64.13$
spatial-LSTM-ICNet v6	$92.79$	$60.90$	$63.05$
depth-LSTM-ICNet v6	$92.30$	$60.93$	$61.27$
sep-LSTM-ICNet v6	$92.42$	$60.78$	$59.72$

Table 3. TABLE III: Evaluation on Virtual Kitti

approach	acc. (%)	mIoU (%)	mFP (‰)
ICNet [24]	$92.60$	$58.44$	$54.42$
LSTM-ICNet v2 [14]	$93.01$	$59.71$	$51.55$
spatial-LSTM-ICNet v2	$92.89$	$59.55$	$51.33$
depth-LSTM-ICNet v2	$92.86$	$59.77$	$51.70$
sep-LSTM-ICNet v2	$92.68$	$58.72$	$52.16$
LSTM-ICNet v5 [14]	$92.96$	$60.19$	$51.66$
spatial-LSTM-ICNet v5	$93.02$	$60.37$	$51.27$
depth-LSTM-ICNet v5	$92.56$	$59.09$	$52.56$
sep-LSTM-ICNet v5	$92.57$	$58.67$	$52.22$
LSTM-ICNet v6 [14]	$93.07$	$60.50$	$50.90$
spatial-LSTM-ICNet v6	$92.99$	$60.54$	$51.10$
depth-LSTM-ICNet v6	$92.53$	$59.03$	$53.52$
sep-LSTM-ICNet v6	$92.64$	$58.85$	$52.99$

Table 4. TABLE IV: Comparison of Memory, FLOPs, and inference time

approach	model parameters		GFLOPs		inference time (CPU)		inference time (GPU)
	amount	percentage	amount	percentage	time	percentage	time	percentage
ICNet [24]	$6707 k$	$100.00 %$	$58.38$	$100.00 %$	$1388 m s$	$100.00 %$	$48.19 m s$	$100.00 %$
LSTM-ICNet v2 [14]	$7887 k$	$100.00 %$	$135.74$	$100.00 %$	$2503 m s$	$100.00 %$	$65.01 m s$	$100.00 %$
spatial-LSTM-ICNet v2	$7494 k$	$95.02 %$	$109.97$	$81.01 %$	$2265 m s$	$90.47 %$	$68.15 m s$	$104.83 %$
depth-LSTM-ICNet v2	$𝟔𝟕𝟏𝟕 𝐤$	$85.17 %$	$59.03$	$43.49 %$	$𝟏𝟔𝟏𝟔 𝐦 𝐬$	$64.56 %$	$60.62 𝐦𝐬$	$93.24 %$
sep-LSTM-ICNet v2	$6848 k$	$86.83 %$	$67.62$	$49.82 %$	$1734 m s$	$69.26 %$	$62.18 m s$	$95.65 %$
LSTM-ICNet v5 [14]	$10247 k$	$100.00 %$	$173.98$	$100.00 %$	$2967 m s$	$100.00 %$	$69.64 m s$	$100.00 %$
spatial-LSTM-ICNet v5	$9068 k$	$88.49 %$	$135.4$	$77.87 %$	$2565 m s$	$86.45 %$	$74.82 m s$	$107.43 %$
depth-LSTM-ICNet v5	$𝟔𝟕𝟑𝟔 𝐤$	$65.74 %$	$59.36$	$34.12 %$	$𝟏𝟔𝟔𝟒 𝐦 𝐬$	$56.07 %$	$62.65 𝐦𝐬$	$89.96 %$
sep-LSTM-ICNet v5	$7129 k$	$69.57 %$	$72.20$	$41.50 %$	$1828 m s$	$61.61 %$	$65.08 m s$	$93.46 %$
LSTM-ICNet v6 [14]	$11428 k$	$100.00 %$	$251.34$	$100.00 %$	$4125 m s$	$100.00 %$	$80.16 m s$	$100.00 %$
spatial-LSTM-ICNet v6	$9855 k$	$86.24 %$	$187.07$	$74.43 %$	$3403 m s$	$82.49 %$	$88.08 m s$	$109.87 %$
depth-LSTM-ICNet v6	$𝟔𝟕𝟒𝟔 𝐤$	$59.03 %$	$60.02$	$23.88 %$	$𝟏𝟖𝟕𝟗 𝐦 𝐬$	$45.55 %$	$68.60 𝐦𝐬$	$85.57 %$
sep-LSTM-ICNet v6	$7270 k$	$63.62 %$	$81.44$	$32.40 %$	$2127 m s$	$51.55 %$	$72.68 m s$	$90.66 %$

Equations28

I_{t} F_{t} J_{t} O_{t} C_{t} H_{t} = σ (W_{x i} * X_{t} + W_{hi} * H_{t - 1} + b_{i}) = σ (W_{x f} * X_{t} + W_{h f} * H_{t - 1} + b_{f}) = tanh (W_{x c} * X_{t} + W_{h c} * H_{t - 1} + b_{c}) = σ (W_{x o} * X_{t} + W_{h o} * H_{t - 1} + b_{o}) = F_{t} \circ C_{t - 1} + I_{t} \circ J_{t} = O_{t} \circ tanh (C_{t})

I_{t} F_{t} J_{t} O_{t} C_{t} H_{t} = σ (W_{x i} * X_{t} + W_{hi} * H_{t - 1} + b_{i}) = σ (W_{x f} * X_{t} + W_{h f} * H_{t - 1} + b_{f}) = tanh (W_{x c} * X_{t} + W_{h c} * H_{t - 1} + b_{c}) = σ (W_{x o} * X_{t} + W_{h o} * H_{t - 1} + b_{o}) = F_{t} \circ C_{t - 1} + I_{t} \circ J_{t} = O_{t} \circ tanh (C_{t})

(16 \cdot K_{x} \cdot K_{y} \cdot I + 37) \cdot O \cdot D_{x} \cdot D_{y}

(16 \cdot K_{x} \cdot K_{y} \cdot I + 37) \cdot O \cdot D_{x} \cdot D_{y}

I_{t} F_{t} J_{t} O_{t} C_{t} H_{t} = σ (W_{x i}^{w} * (W_{x i}^{h} * X_{t}) + W_{hi}^{w} * (W_{hi}^{h} * H_{t - 1}) + b_{i}) = σ (W_{x f}^{w} * (W_{x f}^{h} * X_{t}) + W_{h f}^{w} * (W_{h f}^{h} * H_{t - 1}) + b_{f}) = tanh (W_{x c}^{w} * (W_{x c}^{h} * X_{t}) + W_{h c}^{w} * (W_{h c}^{h} * H_{t - 1}) + b_{c}) = σ (W_{x o}^{w} * (W_{x o}^{h} * X_{t}) + W_{h o}^{w} * (W_{h o}^{h} * H_{t - 1}) + b_{o}) = F_{t} \circ C_{t - 1} + I_{t} \circ J_{t} = O_{t} \circ tanh (C_{t})

I_{t} F_{t} J_{t} O_{t} C_{t} H_{t} = σ (W_{x i}^{w} * (W_{x i}^{h} * X_{t}) + W_{hi}^{w} * (W_{hi}^{h} * H_{t - 1}) + b_{i}) = σ (W_{x f}^{w} * (W_{x f}^{h} * X_{t}) + W_{h f}^{w} * (W_{h f}^{h} * H_{t - 1}) + b_{f}) = tanh (W_{x c}^{w} * (W_{x c}^{h} * X_{t}) + W_{h c}^{w} * (W_{h c}^{h} * H_{t - 1}) + b_{c}) = σ (W_{x o}^{w} * (W_{x o}^{h} * X_{t}) + W_{h o}^{w} * (W_{h o}^{h} * H_{t - 1}) + b_{o}) = F_{t} \circ C_{t - 1} + I_{t} \circ J_{t} = O_{t} \circ tanh (C_{t})

(32 \cdot K_{x} \cdot I + 37) \cdot O \cdot D_{x} \cdot D_{y},

(32 \cdot K_{x} \cdot I + 37) \cdot O \cdot D_{x} \cdot D_{y},

\frac{( 32 \cdot K _{x} \cdot I + 37 ) \cdot O \cdot D _{x} \cdot D _{y}}{( 16 \cdot K _{x} \cdot K _{x} \cdot I + 37 ) \cdot O \cdot D _{x} \cdot D _{y}} \approx \frac{2}{K _{x}} .

\frac{( 32 \cdot K _{x} \cdot I + 37 ) \cdot O \cdot D _{x} \cdot D _{y}}{( 16 \cdot K _{x} \cdot K _{x} \cdot I + 37 ) \cdot O \cdot D _{x} \cdot D _{y}} \approx \frac{2}{K _{x}} .

I_{t} F_{t} J_{t} O_{t} C_{t} H_{t} = σ (W_{x i} ⊛ X_{t} + W_{hi} ⊛ H_{t - 1} + b_{i}) = σ (W_{x f} ⊛ X_{t} + W_{h f} ⊛ H_{t - 1} + b_{f}) = tanh (W_{x c} ⊛ X_{t} + W_{h c} ⊛ H_{t - 1} + b_{c}) = σ (W_{x o} ⊛ X_{t} + W_{h o} ⊛ H_{t - 1} + b_{o}) = F_{t} \circ C_{t - 1} + I_{t} \circ J_{t} = O_{t} \circ tanh (C_{t})

I_{t} F_{t} J_{t} O_{t} C_{t} H_{t} = σ (W_{x i} ⊛ X_{t} + W_{hi} ⊛ H_{t - 1} + b_{i}) = σ (W_{x f} ⊛ X_{t} + W_{h f} ⊛ H_{t - 1} + b_{f}) = tanh (W_{x c} ⊛ X_{t} + W_{h c} ⊛ H_{t - 1} + b_{c}) = σ (W_{x o} ⊛ X_{t} + W_{h o} ⊛ H_{t - 1} + b_{o}) = F_{t} \circ C_{t - 1} + I_{t} \circ J_{t} = O_{t} \circ tanh (C_{t})

(16 \cdot K_{x} \cdot K_{y} + 37) \cdot O \cdot D_{x} \cdot D_{y},

(16 \cdot K_{x} \cdot K_{y} + 37) \cdot O \cdot D_{x} \cdot D_{y},

\frac{( 16 \cdot K _{x} \cdot K _{y} + 37 ) \cdot O \cdot D _{x} \cdot D _{y}}{( 16 \cdot K _{x} \cdot K _{y} \cdot I + 37 ) \cdot O \cdot D _{x} \cdot D _{y}} \approx \frac{1}{I} .

\frac{( 16 \cdot K _{x} \cdot K _{y} + 37 ) \cdot O \cdot D _{x} \cdot D _{y}}{( 16 \cdot K _{x} \cdot K _{y} \cdot I + 37 ) \cdot O \cdot D _{x} \cdot D _{y}} \approx \frac{1}{I} .

I_{t} F_{t} J_{t} O_{t} C_{t} H_{t} = σ (W_{x i}^{1 \times 1} * (W_{x i} ⊛ X_{t}) + W_{hi}^{1 \times 1} * (W_{hi} ⊛ H_{t - 1}) + b_{i}) = σ (W_{x f}^{1 \times 1} * (W_{x f} ⊛ X_{t}) + W_{h f}^{1 \times 1} * (W_{h f} ⊛ H_{t - 1}) + b_{f}) = tanh (W_{x c}^{1 \times 1} * (W_{x c} ⊛ X_{t}) + W_{h c}^{1 \times 1} * (W_{h c} ⊛ H_{t - 1}) + b_{c}) = σ (W_{x o}^{1 \times 1} * (W_{x o} ⊛ X_{t}) + W_{h o}^{1 \times 1} * (W_{h o} ⊛ H_{t - 1}) + b_{o}) = F_{t} \circ C_{t - 1} + I_{t} \circ J_{t} = O_{t} \circ tanh (C_{t}),

I_{t} F_{t} J_{t} O_{t} C_{t} H_{t} = σ (W_{x i}^{1 \times 1} * (W_{x i} ⊛ X_{t}) + W_{hi}^{1 \times 1} * (W_{hi} ⊛ H_{t - 1}) + b_{i}) = σ (W_{x f}^{1 \times 1} * (W_{x f} ⊛ X_{t}) + W_{h f}^{1 \times 1} * (W_{h f} ⊛ H_{t - 1}) + b_{f}) = tanh (W_{x c}^{1 \times 1} * (W_{x c} ⊛ X_{t}) + W_{h c}^{1 \times 1} * (W_{h c} ⊛ H_{t - 1}) + b_{c}) = σ (W_{x o}^{1 \times 1} * (W_{x o} ⊛ X_{t}) + W_{h o}^{1 \times 1} * (W_{h o} ⊛ H_{t - 1}) + b_{o}) = F_{t} \circ C_{t - 1} + I_{t} \circ J_{t} = O_{t} \circ tanh (C_{t}),

(16 \cdot K_{x} \cdot K_{y} + 16 \cdot I + 37) \cdot O \cdot D_{x} \cdot D_{y},

(16 \cdot K_{x} \cdot K_{y} + 16 \cdot I + 37) \cdot O \cdot D_{x} \cdot D_{y},

\frac{( 16 \cdot K _{x} \cdot K _{y} + 16 \cdot I + 37 ) \cdot O \cdot D _{x} \cdot D _{y}}{( 16 \cdot K _{x} \cdot K _{y} \cdot I + 37 ) \cdot O \cdot D _{x} \cdot D _{y}} \approx \frac{1}{I} + \frac{1}{K _{x} \cdot K _{y}}

\frac{( 16 \cdot K _{x} \cdot K _{y} + 16 \cdot I + 37 ) \cdot O \cdot D _{x} \cdot D _{y}}{( 16 \cdot K _{x} \cdot K _{y} \cdot I + 37 ) \cdot O \cdot D _{x} \cdot D _{y}} \approx \frac{1}{I} + \frac{1}{K _{x} \cdot K _{y}}

\displaystyle mFP=\frac{1}{h\cdot w}\cdot\sum_{t=1}^{T}\;\biggl{\lVert}\mathbf{D}[t]\barwedge\mathbf{D}[t-1]\biggr{\rVert}_{1}

\displaystyle mFP=\frac{1}{h\cdot w}\cdot\sum_{t=1}^{T}\;\biggl{\lVert}\mathbf{D}[t]\barwedge\mathbf{D}[t-1]\biggr{\rVert}_{1}

D [t] = (S [t] ⊼ G [t]) \circ S [t]

D [t] = (S [t] ⊼ G [t]) \circ S [t]

\displaystyle mFIP=\frac{1}{h\cdot w}\cdot\sum_{t=1}^{T}\;\biggl{\lVert}\mathbf{S}[t]\barwedge\mathbf{S}[t-1]\biggr{\rVert}_{1}

\displaystyle mFIP=\frac{1}{h\cdot w}\cdot\sum_{t=1}^{T}\;\biggl{\lVert}\mathbf{S}[t]\barwedge\mathbf{S}[t-1]\biggr{\rVert}_{1}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

Andreas-Pfeuffer/LSTM-ICNet
tf

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsConvolution · ConvLSTM · Sigmoid Activation · Tanh Activation · Long Short-Term Memory

Full text

Separable Convolutional LSTMs for Faster Video Segmentation

Andreas Pfeuffer1 and Klaus Dietmayer1 1Andreas Pfeuffer, and Klaus Dietmayer are with the Institute of Measurement, Control, and Microtechnology, Ulm University, 89081 Ulm, Germany, [email protected]

Abstract

Semantic Segmentation is an important module for autonomous robots such as self-driving cars. The advantage of video segmentation approaches compared to single image segmentation is that temporal image information is considered, and their performance increases due to this. Hence, single image segmentation approaches are extended by recurrent units such as convolutional LSTM (convLSTM) cells, which are placed at suitable positions in the basic network architecture. However, a major critique of video segmentation approaches based on recurrent neural networks is their large parameter count and their computational complexity, and so, their inference time of one video frame takes up to 66 percent longer than their basic version. Inspired by the success of the spatial and depthwise separable convolutional neural networks, we generalize these techniques for convLSTMs in this work, so that the number of parameters and the required FLOPs are reduced significantly. Experiments on different datasets show that the segmentation approaches using the proposed, modified convLSTM cells achieve similar or slightly worse accuracy, but are up to 15 percent faster on a GPU than the ones using the standard convLSTM cells. Furthermore, a new evaluation metric is introduced, which measures the amount of flickering pixels in the segmented video sequence.

This work has been submitted to the IEEE for possible publication. Copyright may be transferred without notice, after which this version may no longer be accessible.

I Introduction

Autonomous robots such as self-driving cars and mobile house-robots require a good scene understanding of its environment. For instance, the drivable area of the robot’s surrounding is often determined by means of semantic segmentation of the images delivered by the installed cameras. The main goal of semantic segmentation is to predict a class for each image pixel, e.g. it determines if the image pixel belongs to the road, to a vehicle or to the background. Usually, the cameras deliver a stream of images. These video sequences are often segmented by processing each video frame independently from each other by means of single image segmentation approaches such as ICNet [24], PSPNet [25], and Deeplab [2]. Therefore, object edges are usually flickering between two frames of the video sequence, and ghost objects or partly incorrectly segmented objects often occur only in a single frame, while they are classified correctly in the next frame. These errors can be avoided by the use of image information of previous frames. For instance, image information of the last frames can be considered by means of recurrent neural networks (RNN) to improve the segmentation accuracy of the current frame. Popular RNNs are Long-Short-Term Memories (LSTMs, [8]) networks, and their extension convolutional LSTMs (convLSTMs, [17]), since they can be easily trained and integrated in neural networks. Different video segmentation approaches using convLSTMs [14, 18, 23] show that the performance increases due to the usage of temporal image information and the amount of flickering (ghost) objects and object edges can be reduced. However, there is no appropriate evaluation metric in the literature, which can measure these flickering image pixels. Evaluation metrics, such as pixelwise accuracy (acc.) or mean Intersection of Union (mIoU) are not suitable to detect flickering image pixels, since their number of errors are very small compared to the correctly classified pixels, so that the flickering has been only analyzed qualitatively until now. Therefore, the evaluation metric mean Flickering Pixels (mFP) is introduced in this work, which delivers a measure for flickering image pixels of a video sequence by means of the difference of neighboring frames.

Another problem of using convLSTM cells for the video segmentation task is that they are computational expensive, and the number of model parameters of the neural network is increased enormously.

For example, Fig. 1 compares the number of parameters, the required FLOPs and the corresponding inference time of different LSTM-ICNet versions proposed in [14] with the original ICNet [24]. Although the LSTM-ICNet version 2 is only extended by one convLSTM cell, the number of FLOPs increases about $133\%$ , the model parameters about $18\%$ and the inference time about $35\%$ from $48ms$ to $65ms$ . The other LSTM-ICNet versions containing more convLSTM cells take even longer, and their number of parameters is much greater. Inspired by the popular acceleration techniques of standard convolution layers, different possibilities are introduced and discussed in this work to speed up the convLSTM cells and to reduce the large parameter count at the same time to make convLSTMs more suitable for real-time video segmentation.

II Related Work

Nowadays, recurrent neural networks (RNNs) are successfully applied in several applications, such as speech-recognition [6, 12], machine translation [19], and hand-writing recognition [7]. The success is based on the introduction of the Long-Short-Term Memory networks (LSTMs [8]) in 1997, which overcomes the vanishing/exploding gradients problem of the classical RNNs. The LSTM cell consists of a memory cell to save state information. During training, the cell learns when to read the memory and when to erase or write to it. However, LSTMs are computational costly and memory-excessive. Therefore, there are different approaches in the literature to overcome this problem. For instance, in [10], two different possibilities are introduced to make LSTM cells more computationally efficient and to reduce the number of parameters at the same time. Motivated by the AlexNet [20], the authors partition the LSTM cell in small independent feature groups, and the output of each group is concatenated at the end to a common feature map. Furthermore, they propose a Factorized LSTM cell, in which the weight matrix $\mathbf{W}$ of size $2n\times 4n$ is approximated by the matrix product of two smaller matrices $\mathbf{W}\approx\mathbf{W}_{1}\cdot\mathbf{W}_{2}$ , where $\mathbf{W}_{1}$ is of size $2n\times r$ and $\mathbf{W}_{2}$ is $r\times 4n$ , $n$ denotes the cell size and $r$ is chosen so that $\mathbf{W}$ is well approximated. In [10], $r$ was set to $512$ using a cell size of $n=8192$ . In 2015, Shi et al. [17] proposed the convolutional LSTMs (convLSTMs), which are a generalization of LSTMs for image processing tasks. Their advantage is that they are translational invariant analogously to convolution layers and the required model parameters can be significantly reduced. Nevertheless, convLSTMs are still time and memory expensive, and hence, different possibilities are discussed in this work how to reduce the computational costs and the number of parameters further.

A common way to accelerate standard convolution layers in neural networks is to use spatially separable convolutions, e.g. in [21], a $n\times n$ convolution layer of a neural network is approximated by a $n\times 1$ convolution layer followed by a $1\times n$ convolution layer, which reduces the number of parameters and FLOPs by $33\%$ in case of $n=3$ . Furthermore, convolution layer can also be separated depthwise, such as in [9]. In this case, each input channel is convolved independently with one filter and the amount of FLOPs and parameters is reduced enormously compared to the standard convolution. In [3, 9], a $1\times 1$ (pointwise) convolution is applied after the depthwise convolution, to combine the outputs of the depthwise layers. This combination of depthwise convolution and $1\times 1$ convolution is called (depthwise) separable convolution in literature. These acceleration techniques of standard convolutions are generalized and applied to convLSTMs in the following sections, so that the number of parameters and the required FLOPs are reduced significantly.

In the literature, there are only a few approaches using recurrent neural networks (RNNs) for video segmentation, since the video sequences are usually split into its individual frames, which are processed independently of each other by state-of-the-art segmentation approaches such as ICNet [24], PSPNet [25] or Deeplab [2]. However, temporal image information is not considered in these works, but which can improve the segmentation accuracy further, as Valipour et al. show in [22]. The authors showed on several datasets that the performance of the Fully Convolutional Network (FCN) [16] can be increased if a recurrent unit is placed between the encoder and the decoder. A similar approach was proposed in [23], where a modified VGG19 architecture [18] was used instead of the FCN. Furthermore, different recurrent structures such as convRNN, convGRU, and convLSTM were compared in this work. It turned out that the convLSTM cells achieve the greatest accuracy. In [15], skip-connections containing a convLSTM cell were added to the encoder-LSTM-decoder architecture. In recent years, state-of-the-art approaches are rarely based on the classical encoder-decoder architecture, but use multi-branch architectures instead. Hence, the LSTM-ICNet was introduced in [14], in which the ICNet [24] was extended by LSTM cells at suitable positions. According to [14], the different LSTM-ICNet versions outperform the pure CNN-based approach up to $1.5$ percent, but their inference time increases up to $32ms$ , which corresponds to an increase of about $66$ percent. In the following work, the LSTM-ICNet versions are sped up by means of the proposed separable convLSTMs, and their inference time decreases significantly.

III Separable Convolutional LSTMs

Standard convLSTMs [17] are very popular for capturing temporal information in data sequences. They use convolutional layers in their input-to-state and state-to-state transitions instead of fully connected layers, as conventional LSTMs [8] do. The output $\mathbf{H}_{t}$ and the cell-states of the convLSTM layer are calculated by:

[TABLE]

where $\ast$ denotes the convolution operator and $\circ$ means the Hadamard product. The $\mathbf{W}$ -terms are weight matrices, and the $\mathbf{b}$ -terms are biases. $\mathbf{X}$ denotes the input of the convLSTM cell, and $\mathbf{H}$ the corresponding output, while $\mathbf{O}$ and $\mathbf{C}$ are the output state and the cell state respectively. Note, that the convLSTM cell of (1) does not contain peephole connections like [17]. However, the extension of (1) and the remaining equations with peephole connections are straight forward, and thus, they are neglected for the sake of brevity. All in all,

[TABLE]

FLOPs are necessary using a kernel size of $K_{x}\times K_{y}$ and a feature map of size $D_{x}\times D_{y}$ , if assumed that the activation functions sigma and tanh takes 5 FLOPs. The number of input channels is $I$ , and the number of output channels is $O$ . A detailed list of the individual components of one convLSTM cell is given in Table I.

The disadvantage of convLSTMs are their vast computational costs and their large memory consumption. For instance, LSTM-ICNet version 2, which contains only one convLSTM layer before the softmax operation, takes about $18ms$ longer on a GPU, which corresponds to an increase in inference time of about $35\%$ . Therefore, three different possibilities are described on how to reduce the computational costs of LSTM-ICNet.

III-A Spatially Separable Convolutional LSTMs

One possibility to reduce the number of FLOPs and the number of parameters is to replace a $n\times n$ convLSTM layer by a $n\times 1$ convLSTM layer followed by a $1\times n$ convLSTM layer analogously to the Inception V3 modules [21]. However, the convLSTM layers do not only consist of convolutions but also of other costly operations such as activation functions or elementwise multiplications (see (1) and Table I for more details). These costly operations have to be applied twice in this case. Hence, a more efficient way is to perform the spatial separation inside instead of outside of the convLSTM cell, so that the remaining operations are only executed once. In other words, each convolution $\mathbf{W}\ast\mathbf{Y}$ inside the convLSTM cell is approximated by $\mathbf{W}^{w}\ast(\mathbf{W}^{h}\ast\mathbf{Y})$ , where $\mathbf{W}$ is a $K_{x}\times K_{y}$ filter-kernel, and $\mathbf{W}^{h}$ and $\mathbf{W}^{w}$ are $K_{x}\times 1$ and $1\times K_{y}$ filter-kernels, respectively. The corresponding key equations of the spatially separable convLSTM (spatial-convLSTM) are

[TABLE]

Spatial convLSTMs have the computational cost of:

[TABLE]

if $K_{x}=K_{y}$ . By using spatial-convLSTMs instead of a standard convLSTMs, the necessary computational expenses of one cell are reduced to

[TABLE]

In case of $K_{x}=K_{y}=3$ and $I=O=128$ , a speed-up of $66.73\%$ can be yielded in theory.

III-B Depthwise Convolutional LSTMs

The usage of depthwise convolutions instead of standard convolutions is an efficient way to reduce the computational costs, as the MobileNets [9] versions show. Hence, this concept is adapted to speed up the convLSTM layers. More concretely, all eight convolutions inside a convLSTM layer are replaced by depthwise convolutions, so that the depthwise convLSTM (depth-convLSTM) can be written as:

[TABLE]

where $\circledast$ denotes the depthwise convolution operator. A great advantage of depth-convLSTMs is that the number of parameters and the required FLOPs decrease enormously compared to standard convLSTMs and also to spatial-convLSTMs. For instance, the amount of FLOPs is only

[TABLE]

which results in a theoretical speed-up of

[TABLE]

In case of $K_{x}=K_{y}=3$ and $I=O=128$ , the computational costs amount to only $0.98\%$ of the standard convLSTM ones.

III-C Depthwise Separable Convolutional LSTMs

A disadvantage of depth-convLSTM is that cross-channel information within the convLSTM cell are not used. Similarly to standard depthwise convolutions, this problem can be solved by applying $1\times 1$ convolutions after each depthwise convolution inside the convLSTM layer, so that the cross-channel information are again considered. More concretely, each convolution within the convLSTM cell is replaced by a (depthwise) separable convolution, and hence, for the depthwise separable convLSTM (sep-convLSTM) it holds:

[TABLE]

where the $\mathbf{W}^{1\times 1}$ -terms are the corresponding weight matrices of the $1\times 1$ convolution. In contrast to depth-convLSTM, sep-convLSTMs are computational more costly and take about

[TABLE]

FLOPs. Nevertheless, they are still much more efficient than standard convLSTM, and spatial-convLSTMs and only

[TABLE]

of the FLOPs of one standard convLSTM cell are necessary. For example, it only takes $12.1\%$ of the computational costs of the standard convLSTM in case of $K_{x}=K_{y}=3$ and $I=O=128$ .

IV Evaluation Metric Flickering Pixels

In video segmentation tasks, several errors only occur in a single frame of the video sequence, and are classified correctly in the following frames. For instance, typical errors are flickering edges and flickering (ghost) objects or object parts. However, these errors can hardly be captured by the conventional evaluation metrics such as pixelwise accuracy and mIoU, since they only consider the results of one time step, and the proportion of these flickering pixels is very small in contrast to the remaining pixels. Hence, an evaluation metric is necessary which measures the flickering pixel errors within a video sequence, and thus, the evaluation metric mean Flickering Pixels (mFP) is introduced, which measures the mismatch of the segmentation result of two neighboring frames. To compensate the motion of moving objects such as walking pedestrians and the ego-motion of the self-driving car between the individual frames, the difference between the result and the corresponding ground-truth is determined first, before the neighboring frames of the video sequence are compared. Furthermore, the mFP is weighted by the amount of image pixels so that the metric is independent of the image size. More concretely, the mFP is defined as follows:

[TABLE]

where $\left|\left|\cdot\right|\right|_{1}$ is the p1-norm, which sums up the absolute values of all matrix elements, $\barwedge$ denotes the elementwise NAND-operator of two matrices, and $h$ and $w$ are the height and the width of the input image, respectively. $\mathbf{D}[t]$ is the difference image of the yielded segmentation map $\mathbf{S}[t]$ ( $\mathbf{S}\in\mathbb{N}^{h\times w}$ , $\mathbf{S}_{ij}\in\left\{0,1,\dots,N-1\right\}$ , and $N$ is the number of classes) and the corresponding ground-truth $\mathbf{G}[t]$ at time step $t$ , and is calculated by

[TABLE]

The term $\mathbf{S}[t]\barwedge\mathbf{G}[t]$ is multiplied elementwise by the segmentation map $\mathbf{S}[t]$ , since $\mathbf{S}[t]\barwedge\mathbf{G}[t]$ is only boolean, and by means of this multiplication, it is predicted which class is different. Fig. 2 shows the flickering pixels of a short video-sequence according to the mFP-metric. The disadvantage of mFP is that the ground-truth for all frames of the video sequence is necessary. However, semantic labeling of each frame is very time-intensive, and so, the ground-truth of only a few frames of the video sequence is often available in common datasets. Therefore, a trimmed-down version of mFP is also proposed, the so called mFIP (mean Flickering Image Pixels), which is independent of the ground-truth. Instead, it measures the mismatch of two neighboring segmentation maps directly. Due to this, moving objects and the ego-motion of the self-driving car cause additional flickering, but it is assumed that this flickering is approximately constant for all evaluated approaches. So, the mFIP is determined by

[TABLE]

In Fig. 3, the flickering image pixels of a short video-sequence are shown according to the mFIP-metric. All in all, the lower the mFP and mFIP values are, the less flickering points exist in a video sequence. Note, that mFP and mFIP do not state anything about the accuracy. For example, if each pixel of a video sequence is classified identically, mFIP is zero, but its performance is very worse.

V Evaluation

In this section, the proposed convLSTM cells are compared with the standard convLSTM cells by means of the video segmentation task. First, the used network architectures are described, and some information about the used datasets and the trainings details are given. Moreover, the proposed convLSTM cells are evaluated by means of different evaluation metrics. Finally, the computational costs and their parameter count are discussed.

V-A Network Architectures

For the evaluation of the proposed convLSTM cells, three different versions of the LSTM-ICNet introduced in [14] are considered, and modified by means of the proposed acceleration techniques to reduce their parameter count and their computational costs. All three LSTM-ICNet versions are extended versions of the ICNet [24], where convLSTM layers are added at different positions. In LSTM-ICNet version 2, the convLSTM layer is placed before the softmax layer, and in version 5, a convLSTM layer is added at the end of each resolution branch of the ICNet. LSTM-ICNet version 6 is a combination of version 2 and 5, where the convLSTM layers are in front of the softmax layer and at the end of each resolution branch. For more details about the corresponding network architectures see [14]. The convLSTM cells of all considered versions are replaced by spatial-convLSTM cells, by depth-convLSTM cells and sep-convLSTM cells respectively, and in the following they are compared with the standard versions. Each of the convLSTM cells has a kernel size of $3$ and the same number of output channels as its previous layer.

V-B Datasets and Training Details

In the following, the Cityscapes dataset [4] and virtual Kitti dataset [5] are used for evaluation. The Cityscapes dataset consists of $5000$ color images and the corresponding fine-annotated semantic labels. Furthermore, the $19$ previous images are also available for each image of the dataset. Similar to other works [13, 14, 24, 25], 19 of the 30 available classes are used for evaluation. In the dataset, there are 2975 images for training, and 500 images for validation. The proposed approaches are evaluated on the validation set, since the ground-truth of the test images is not publicly available. The virtual Kitti dataset is a photo-realistic synthetic dataset for semantic segmentation containing 50 video sequences in different weather conditions (e.g. rain, fog, sunset), which are 21260 images in total. For each video frame, the corresponding ground-truth is available. Analogously to [23], the dataset is split into two subsets. The first halves of each video sequence are used for training and the second ones for testing.

Through all of our experiments, video sequences consisting of four frames are considered, i.e. the temporal image information of the last three frames $t-3$ , $t-2$ , and $t-1$ are used for determining the segmentation map of frame $t$ . The training loss is yielded by means of the cross entropy loss described in [24], but the loss is only determined for the last frame of the video sequence similarly to [14], i.e. the result of the previous frames $t-1$ , $t-2$ , and $t-1$ are not considered for the loss calculation. The training loss is minimized by means of Stochastic Gradient Descent (SGD) with weight decay of $0.0001$ and a momentum equal to $0.9$ . Moreover, each model is trained for $30k$ , and a poly-learning rate policy is used starting with an initial learning rate of $0.001$ . All states of the convLSTM cell are initialized with zero, which corresponds to a complete ignorance of the past, while the remaining parameters are initialized with the same pretrained network. During training, the video sequences are randomly scaled and flipped to avoid overfitting, and the batch size was set to one for Cityscapes and to two for virtual Kitti due to memory reasons. Note, that our ICNet implementation differs slightly from its original implementation [24], because we skip the model compression afterwards. Instead, we trained with the half feature size. Due to this and due to the small batch size, a mIoU of $60.2\%$ is achieved on Cityscapes (batch size 1), while a mIoU-value of $67.6\%$ (batch size 16) is yielded by the original implementation.

V-C Performance Evaluation

Now, the different LSTM-ICNet versions described in the previous sections are evaluated by means of pixelwise accuracy (acc.), mIoU, and mFP/mFIP on various datasets. Note, that the number of flickering image pixels (mFIP) is not determined from the validation set of the Cityscapes dataset, since the video-clips are too short. Instead, the mFIP-values are calculated from the demo-videos of this dataset. Table II and Table III contain the corresponding results of the Cityscapes dataset and of the virtual Kitti dataset, respectively. For both datasets, the standard LSTM-ICNet versions outperform the original ICNet by means of accuracy and mIoU. Furthermore, the amount of flickering pixels is reduced significantly compared to the traditional ICNet. In the previous sections, different acceleration techniques of convLSTM cells were discussed. It turns out that the spatial-LSTM-ICNet versions achieve similar results as the corresponding LSTM-ICNet versions, in some cases they even perform slightly better. In contrast, the depth-LSTM-ICNet and sep-LSTM-ICNet versions perform worse in terms of pixelwise accuracy and mIoU, but they still outperform the origin ICNet in most instances. Nevertheless, there are much less flickering points in the video sequences according to their mFP/mFIP-values than resulting from the origin ICNet.

V-D Computational Costs and Memory Consumption

In this section, the computational costs and the memory consumption of the proposed approaches are discussed. Each approach was implemented in Tensorflow [1], and executed on a single GPU (Nvidia TitanX) and on a CPU (Intel(R) Core(TM) i5-6300U CPU @ 2.40GHz). Table IV contains the corresponding results using an input image of size $1024\times 2048$ . The experiments show that the number of FLOPs are reduced by means of the proposed spatial-convLSTMs, depth-convLSTMs, and spatial-convLSTMs. As expected, the depth-convLSTMs versions require the least number of FLOPs, since depthwise convolutions are more computational efficient than spatial and depthwise separable convolutions. Furthermore, the percentage reduction of the FLOPs is greatest for the modified LSTM-ICNet version 6, because these versions contain most convLSTM cells, and hence, they can be sped up most by means of the proposed separable convLSTM cells. Due to the reduced number of FLOPs, the modified LSTM-ICNet versions take much less computation time on the CPU than their origin versions, as the results in Table IV show. For example, the inference time of LSTM-ICNet version 6 is approximately halved if the convLSTM cells are replaced by depth-convLSTM cells or sep-convLSTM cells. The inference time is also reduced on the GPU, but not as much as on the CPU, and the spatial convLSTM versions take even longer than the original LSTM-ICNet versions. The reason is that the standard convolution operations are highly optimized in deep-learning frameworks, especially for $3\times 3$ kernels so that the overhead of doing two convolutions on the GPU outweighs its speedup for small kernel sizes. For greater kernel sizes, the spatial convLSTM cells are again faster than the conventional convLSTM cells. The spatially separable convolutions can be surely implemented more efficiently by means of computational tricks such as [11] to reduce their execution time, however, this is out of the scope of this work. Additionally, the required memory decreases for spatial-LSTM-ICNet, depth-LSTM-ICNet and sep-LSTM-ICNet, since their parameter count is reduced by up to $41\%$ . The parameter saving is greatest for the depth-LSTM-ICNet versions, while it is lowest for the spatial-LSTM-ICNet versions.

VI Conclusion

In this paper, three different approaches to speed up standard convLSTM cells were investigated by the video segmentation task. It turned out, that the spatial convLSTM cells achieve similar performance on well-known datasets than the standard convLSTM approaches, and are more computational efficient and require less model parameters at the same time. The number of FLOPs and the parameter count can be increased further by using depthwise or depthwise separable convLSTM cells instead, but at the expense of performance. Furthermore, the evaluation metric mean Flickering Pixels (mFP) was introduced, which measures the number of flickering pixels within a video sequence. Experiments with the proposed separable convLSTM cells show, that the amount of flickering pixels can be reduced significantly, if temporal image information are also considered for segmenting video sequences by means of recurrent neural networks.

VII Acknowledgment

The research leading to these results has received funding from the European Union under the H2020 ECSEL Programme as part of the DENSE project, contract number 692449.

Bibliography25

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Martín Abadi, Ashish Agarwal, Paul Barham, and Eugene Brevdo et al. Tensor Flow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org.
2[2] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV , 2018.
3[3] François Chollet. Xception: Deep learning with depthwise separable convolutions. Co RR , abs/1610.02357, 2016.
4[4] Marius Cordts, Mohamed Omran, Sebastian Ramos, Timo Rehfeld, Markus Enzweiler, Rodrigo Benenson, Uwe Franke, Stefan Roth, and Bernt Schiele. The cityscapes dataset for semantic urban scene understanding. Co RR , abs/1604.01685, 2016.
5[5] A Gaidon, Q Wang, Y Cabon, and E Vig. Virtual worlds as proxy for multi-object tracking analysis. In CVPR , 2016.
6[6] Juergen T. Geiger, Zixing Zhang, Felix Weninger, Björn Schuller, and Gerhard Rigoll. Robust speech recognition using long short-term memory recurrent neural networks for hybrid acoustic modelling.
7[7] Alex Graves. Generating sequences with recurrent neural networks. Co RR , abs/1308.0850, 2013.
8[8] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural Comput. , 9(9):1735–1780, November 1997.