Residual Pyramid Learning for Single-Shot Semantic Segmentation

Xiaoyu Chen; Xiaotian Lou; Lianfa Bai; Jing Han

arXiv:1903.09746·cs.CV·June 19, 2019

Residual Pyramid Learning for Single-Shot Semantic Segmentation

Xiaoyu Chen, Xiaotian Lou, Lianfa Bai, Jing Han

PDF

1 Repo

TL;DR

This paper introduces RPNet, a residual pyramid network for single-shot semantic segmentation that efficiently learns main and residual features to improve accuracy and detail recovery without complex decoders.

Contribution

The paper proposes a novel residual pyramid network that decomposes segmentation labels at multiple levels, enabling efficient single-shot segmentation with enhanced detail recovery.

Findings

01

Achieves high accuracy on CamVid and Cityscapes datasets.

02

Improves segmentation detail using residual features.

03

Offers high efficiency compared to state-of-the-art methods.

Abstract

Pixel-level semantic segmentation is a challenging task with a huge amount of computation, especially if the size of input is large. In the segmentation model, apart from the feature extraction, the extra decoder structure is often employed to recover spatial information. In this paper, we put forward a method for single-shot segmentation in a feature residual pyramid network (RPNet), which learns the main and residuals of segmentation by decomposing the label at different levels of residual blocks. Specifically speaking, we use the residual features to learn the edges and details, and the identity features to learn the main part of targets. At testing time, the predicted residuals are used to enhance the details of the top-level prediction. Residual learning blocks split the network into several shallow sub-networks which facilitates the training of the RPNet. We then evaluate the…

Tables6

Table 1. TABLE I: Backbone network of the RPNet using reproduced ENet encoder [ 25 ] .

	Name	Type	Output Channel
1/2 Scale	Initial		16	Level-3 (Res)
1/2 Scale	Bottleneck1.0	Regular	16	Level-3 (Res)
1/4 Scale	Bottleneck2.0	Downsampling	64	Level-2 (Res)
	Bottleneck2.1	Regular	64
	Bottleneck2.2	Regular	64
	Bottleneck2.3	Regular	64
	Bottleneck2.4	Regular	64
1/8 Scale	Bottleneck3.0	Downsampling	128	Level-1 (main)
	Bottleneck3.1	Regular	128
	Bottleneck3.2	Regular	128
	Bottleneck3.3	Asynnetric	128
	Bottleneck3.4	Regular	128
	Bottleneck3.5	Regular	128
	Bottleneck3.6	Regular	128
	Bottleneck3.7	Asynnetric	128
	Bottleneck3.8	Regular	128
	Repeat section3,without bottleneck3.0
	MainOut	$C o n v$	classes

Table 2. TABLE II: Speed comparison of our method against the baseline of different input sizes on edge (TX2) and desktop (1080Ti) platforms.

Model	NVIDIA TX1						NVIDIA GTX 1080Ti
	480 $\times$ 320		640 $\times$ 360		1280 $\times$ 720		640 $\times$ 360		1280 $\times$ 720		1920 $\times$ 1080
	ms	fps	ms	fps	ms	fps	ms	fps	ms	fps	ms	fps
ENet	55	18	74	13	249	4	8.8	114	10	100	26.2	38
RPNet-sr-bp	47	22	60	16	200	5	6.5	154	6.7	149	14.3	70

Table 3. TABLE III: Speed and parameters analysis of the ENet, ENet encoder and ENet with RPNet at Level2(L2) and Level3(L3).

Method	FLOPs	Parameters	Mean IoU
ENet	3.00B	0.3890M	59.89
ENet encoder	2.62B	0.3507M	59.60
RPNet-sr-bp(L2)	2.64B	0.3514M	62.31
RPNet-sr-bp(L2,L3)	2.65B	0.3516M	63.29
RPNet-sr-gp	2.73B	0.3542M	63.90
RPNet-er-bp	2.65B	0.3516M	64.04
RPNet-er-gp	2.73B	0.3542M	64.67

Table 4. TABLE IV: Comparison of single training with weighted losses and level-wise training.

	EQ	LIN	POLY	NORM	Level-Wise
Mean IoU	60.64	60.48	60.70	60.98	63.29

Table 5. TABLE V: Performance and computation comparison on CamVid (480 × \times 360).

Method	Mean IoU	fps	Parameters	FLOPs
ENet [25]	59.89	111	0.39M	1.50B
ERFNet [27]	60.54	133	2.07M	8.43B
ESPNet [22]	62.6	205	0.68M	0.87B
FC-DenseNet56 [15]	58.9	27	1.5M	26.29B
RPNet(ENet)	64.67	102	0.35M	1.36B
RPNet(ERFNet)	64.82	149	1.89M	6.78B

Table 6. TABLE VI: Speed and accuracy comparison on Cityscapes.

Method	Input Size	Mean IoU	Mean iIoU	fps	FLOPs
ENet [25]	1024 $\times$ 512	58.3	34.4	77	4.03B
ERFNet [27]	1024 $\times$ 512	68.0	40.4	59	25.60B
ESPNet [22]	1024 $\times$ 512	60.3	31.8	139	3.19B
BiSeNet [31]	1536 $\times$ 768	68.4	-	69	26.37B
ICNet [34]	2048 $\times$ 1024	69.5	-	30	-
DeepLab(MobileNet) [7]	2048 $\times$ 1024	70.71 (val)	-	16	21.27B
LRR [11]	2048 $\times$ 1024	69.7	48.0	2	-
RefinNet [20]	2048 $\times$ 1024	73.6	47.2	-	263B
RPNet(ENet)	1024 $\times$ 512	63.37	39.0	88	4.28B
RPNet(ERFNet)	1024 $\times$ 512	67.9	44.9	123	20.71B

Equations31

identity_{l + 1} = identity_{l} + res_{l}

identity_{l + 1} = identity_{l} + res_{l}

\sim identity_{l} = identity_{l + 1} + (- res_{l}),

p_{l} = p_{l + 1} + pres_{l}

p_{l} = p_{l + 1} + pres_{l}

y_{l} = h (x_{l}) + F (x_{l}, W_{l}),

y_{l} = h (x_{l}) + F (x_{l}, W_{l}),

x_{l + 1} = f (y_{l}) .

x_{L} = x_{l} + i = l \sum L - 1 F (x_{i}, W_{i}),

x_{L} = x_{l} + i = l \sum L - 1 F (x_{i}, W_{i}),

res = x_{L} - x_{l} = i = l \sum L - 1 F (x_{i}, W_{i}),

res = x_{L} - x_{l} = i = l \sum L - 1 F (x_{i}, W_{i}),

x_{l} = x_{L} + (- res) .

x_{l} = x_{L} + (- res) .

l_{i} = cr i t er i o n (target_{i}, label_{i})

l_{i} = cr i t er i o n (target_{i}, label_{i})

target_{i} = target_{1} + k = 2 \sum i tres_{k} (i \geq 2) .

label_{i} = label_{1} + k = 2 \sum i lres_{k} (i \geq 2),

label_{i} = label_{1} + k = 2 \sum i lres_{k} (i \geq 2),

min cr i t er i o n (target_{1} + k = 2 \sum i tres_{k}, label_{1} + k = 2 \sum i lres_{k}),

min cr i t er i o n (target_{1} + k = 2 \sum i tres_{k}, label_{1} + k = 2 \sum i lres_{k}),

min cr i t er i o n (tres_{i}, lres_{i}),

min cr i t er i o n (tres_{i}, lres_{i}),

l oss = \sum l os s_{i},

l oss = \sum l os s_{i},

L os s_{k} = i = 1 \sum k l os s_{i} .

L os s_{k} = i = 1 \sum k l os s_{i} .

(1 - i t er / ma x_{i} t er)^{p o w er}

(1 - i t er / ma x_{i} t er)^{p o w er}

W_{c l a ss} = 1/ l o g (P_{c l a ss} + k),

W_{c l a ss} = 1/ l o g (P_{c l a ss} + k),

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

superlxt/RPNet-Pytorch
pytorch

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Residual Pyramid Learning for Single-Shot Semantic Segmentation

Xiaoyu Chen, Xiaotian Lou, Lianfa Bai, and Jing Han The authors are with the school of Electronic and Optical Engineering, Nanjing University of Science and Technology, Nanjing, Jiangsu, 210091 China e-mail: [email protected]; [email protected]; [email protected]; [email protected]

Abstract

Pixel-level semantic segmentation is a challenging task with a huge amount of computation, especially if the size of input is large. In the segmentation model, apart from the feature extraction, the extra decoder structure is often employed to recover spatial information. In this paper, we put forward a method for single-shot segmentation in a feature residual pyramid network (RPNet), which learns the main and residuals of segmentations by decomposing the label at different levels of residual blocks. Specifically speaking, we use the residual features to learn the edges and details, and the identity features to learn the main part of targets. At testing time, the predicted residuals are used to enhance the details of the top-level prediction. Residual learning blocks split the network into several shallow sub-networks which facilitates the training of the RPNet. We then evaluate the proposed method and compare it with recent state-of-the-art methods on CamVid and Cityscapes. The proposed single-shot segmentation based on RPNet achieves impressive results with high efficiency on pixel-level segmentation.

Index Terms:

Intelligent vehicles, real-time vision, scence understanding, residual learning.

I Introduction

Autonomous driving technology has been developing rapidly to increase the transportation efficiency and improve driving experience. Scene parsing from images in autonomous driving systems is getting more and more attention as cameras are the most accessible and inexpensive sensors. With recent advances of deep neural networks (DNNs) [19], the performance of semantic segmentation has been increased greatly. However, how to balance the performance and new huge computation cost, brought by deep neural networks, has became a new problem in real-time systems.

Neural network is often constructed in the shape of pyramid to increase displacement invariability and reduce computation. Recently, researchers use the hidden features of pyramid networks to produce more powerful feature for more complex tasks such as super resolution and semantic segmentation tasks, as spatial details is on the decrease from the bottom to the top of networks [2, 20, 30].

Semantic segmentation is to recognize every pixel of inputs. Therefore it is essential to construct feature maps including features of all pixels. The popular encoder-decoder models use a pyramid network to train an encoder to approximate low-resolution target and employ an inverted pyramid network followed by encoder to reconstruct the high-resolution target, where unpooling [32] and deconvolution [24] are often used in decoders. Because the details are missing during encoding process, the effect to recover details is limited and much extra computation is entailed in decoders. To achieve a balance between efficiency and reliability, some methods delete decoder structure and interpolate the low-resolution result directly at the expense of details [31, 5].

Different levels of features in neural networks represent different levels of semantics. High-level features have strong semantics while low-level features have rich spatial details. Therefore the feature fusion of features from different levels is often adopted in decoder structure.

Although some methods are proposed to uses low-level features for prediction as supplements, simply integrating the low-level and high-level feature will accumulate information redundancy [28, 11]. Since the different levels of features are mutually complemental, we attempt to use the features to approximate the target separately and combine the outputs of them to get the full target.

ResNet [12] is a excellent example of feature separation, which is proposed to ease the training of deep networks by adding identity mapping which separates the feature into residual part and identity part in one block:

[TABLE]

where $\bm{identity}_{l+1}$ and ( $-\bm{res}_{l}$ ) can be seen as independent part of $\bm{identity}_{l}$ . Different from the features pyramid in the plain networks, the features of ResNet form a feature residual pyramid [13], which is similar to the process of Laplacian residual pyramid [4]:

[TABLE]

where $\bm{identity}_{l+1}$ and $\bm{p}_{l+1}$ respectively represent the identity features in ResBlocks and the laplacian pyramid images with higher resolution at $(l+1)$ -th level, $\bm{identity}_{l}$ and $\bm{p}_{l}$ respectively represent that with lower resolution, $\bm{res}_{l}$ and $\bm{pres}_{l}$ represent the residuals. For simplicity, upsampling is omitted before pixel-wise sum in Eqn.(1) and Eqn.(2).

In general, the low-level features in the network with small receptive field contain the information of local texture, which focus on the details and edges of instances, while the high-level features with big receptive field focus on the overall attribute of bigger blobs [32]. In ResNet-like networks, hidden features of ResBlocks are further designed as residual feature ( $\bm{res}_{l}$ ) that represents residual information.

From Eqn.(1) we can see that ( $-\bm{res}_{l}$ ) is lost in feed-forward pass along with the reduction of spatial details. With the characteristics of features in different ResBlocks in the network, We can just use ( $-\bm{res}_{l}$ ) to retrieve the lost details ( $\bm{pres}_{l}$ ).

For this purpose, we decompose the target of segmentation to train the residual pyramid network at training time, in which the details and main of the target are separated and learned respectively with the residual features and high-level identity features. Then at testing time, the approximated target residual pyramid by these features can be reconstructed to full target.

In the overall forward pass of the network, the input is separated into multiple independent residual information branches and main low frequency branch, and then we collect these branches and use them to approximate different parts of the target. By shortening the paths of gradient propagation in residual branches, the residuals learning blocks also facilitate the training process. Besides, the input only passes through single pyramid network without decoding process, which reduce much computation and increase the efficiency significantly.

The concept of single-shot was firstly proposed in object detection task to describe the detectors with single neural network [21, 26]. We quote this concept for our single residual pyramid network, and distinguish it from encoder-decoder methods. In object detection task, the single-shot detector often has higher efficiency than models with multiple networks. The single-shot detector SSD, one of the fastest detectors, uses a single backbone network to obtain different levels of features and generate different scales of boxes on these features for multi-scale detection, saving much computation and reducing latency. In order to enhance the efficiency of segmentation, we use this idea to approximate different levels of target residuals at different layers of backbone network. The single-shot RPNet only entails several simple convolution layers apart from the original pyramid network.

To improve the efficiency of perceiving road environment in intelligent vehicles, this paper puts forward a residual pyramid learning network (RPNet) to learn the residual pyramid of target and implement the single-shot segmentation. We design a loss function to train residual features to residual target. Instead of decomposing labels directly, the key of training a PRNet is to reconstruct the full predicted targets at different levels to compute loss with full labels of different scales in residual pyramid. We implement the RPNet on different backbone networks to evaluate the proposed method.

This paper makes the following contributions:

*(i)*We propose a novel single-shot structure for segmentation to predict different parts of target in a single pyramid feed-forward pass which improves efficiency significantly.

*(ii)*We use residual features to predict multi-level residuals of target by designing a residual loss function and conduct top-down level-wise training. The performance on small objects and details has a obvious promotion.

*(iii)*We design variations of residuals and predictor based on RPNet, and expand the application for arbitrary structures beyond the ResNet-like networks.

*(iiii)*We setup RPNet on existing high-speed segmentation networks and achieve both accuracy and efficiency improvements.

II Related Work

Application of semantic segmentation in autonomous driving systems require high accuracy as well as low latency to ensure driving security. However, the computing source is limited on intelligent vehicles, so improving accuracy with limited computation and maintaining accuracy while reducing computation are two important directions.

II-A Multi-Path and Single-Shot structure

As repeated downsampling operateors in pyramid neural networks lead to a significant decrease in image resolution, many methods are proposed to recover the details from low-level feature.

Encoder-decoder structure is often used to recover the spatial details in semantic segmentation. U-type structure, proposed in U-Net [28], use multi-path to help low-level features skip the middle layers and be combined with the refined high-level feature in the decoder, which enhance the performance on details. FC-DenseNet [15] extended DenseNets [14] in U-type structure and improve the upsampling path in decoder to reduce computation.

Besides of U-type structure, multiple feature fusion is also commonly used to recover details. RefineNet [20] proposes a generic multi-path refinement network that fuses multi-resolution features from different layer to generate high-resolution and high-quality results. However, before multi-resolution fusion, many convolutions are added, and the speed on 512x512 images is only 20fps. Light-Weight RefineNet [23] increases the speed by modifying the RCU and CRP module in original RefineNet, and operates at the speed of 55fps.

BiSeNet [31] uses bilateral segmentation network, Spatial Path and Context Path to achieve both rich spatial information and sizeable receptive field. In order to improve efficiency, the BiSeNet uses a decoder of interpolation. ICNet [34] sets up a image cascade network with multi-resolution branches under proper label guidance to reduce much computation in pixel-level segmentation inference.

Multi-path fusion strategy often uses the full feature of each layer, so there is much information and computing redundancy.

Decoder of single interpolation is often adopted to avoid computing redundancy such as DeepLabV3 [6] and ESPNet [22]. Here we regard the methods with interpolation decoder as one of single-shot structures. But this single-shot structure can result in the loss of details, and a post processing with conditional random field (CRF) [17] is often used to refine the coarse segmentation, which adds more extra latency.

II-B Residual learning for pixel-level approximation

Residual learning is first proposed in classification network [12] to improve the network degradation and gradients vanishing problems. With many advantages in network training, the concept of residual learning has been migrated to other tasks, such as super resolution and semantic segmentation tasks.

Super resolution can be considered as a process of details restoration, so learning sparse residuals is more efficient than learning full images itself of higher resolution. VDSR [16] sets up a deep neural network to learn the high frequency details, which achieve great improvement on accuracy. LapSRN [18] expands the conception to cascade residual learning blocks and progressively reconstructs the sub-band residuals of high-resolution images at multiple pyramid levels and further increase the performance.

In semantic segmentation tasks, there are also many attempts of residual learning. LRR [11] uses low-resolution results to generate a boundary mask for high-resolution feature. Then high-resolution boundary is predicted to refine the result. Global-Local-Refinement [33] iteratively predicts global residuals using the input which concatenates the original image and confidential map.

All these methods have proved that residual learning can not only ease the optimization of network, but also improve the efficiency of the approximation.

II-C Optimizing networks with multiple loss layers

Many networks have multiple loss layers for multiple tasks or stronger supervision. Weighted sum of the losses is often used for training in these networks.

PSPNet [35] and BiSeNet [6, 31] use mid predictors to construct auxiliary loss layer followed by hidden layer. The auxiliary loss, illustrated in [35], helps the optimization during learning process, while the master branch loss takes the most responsibility. The ablation study in [35] is also presented to help decide which weight values to use. The Impatient DNNs [1] attempts to use multiple early prediction layers to deal with dynamic time budgets during application, where the intermediate predictors are learned jointly with the weighted losses. In the paper [1], the authors compare different weights per loss component and choose the best weights to train the network.

In some specific designed network, joint training of multiple losses will break the structure of learned feature. In LRR [11], the coarse and fine semantic segmentations are predicted from top to down in the network, where fine segmentation predictions depend on the higher coarse predictions. So the level-wise training is necessary in LRR.

II-D Improvement for Real-time Segmentation

Real-time segmentation models prefer thin networks with fewer filters so that computing cost can be reduced such as ENet [25]. But simply reducing computation will lead to degradation in performance.

In segmentation task, decoder structure is often removed at the expense of spatial details to reduce computing cost in some methods such as DeepLab v3 and BiSeNet [6, 31].

Another way is to optimize convolution blocks. ERFNet [27] uses residual connections and factorized convolutions [29] to maintain efficiency and accuracy. DeepLab v3+ [7] proposes Atrous Separable Convolution to speed up standard convolution. ESPNet [22] decomposes a standard convolution into a point-wise convolution and a spatial pyramid of dilated convolutions to improve efficiency. BiSeNet [6, 31] uses shallow Spatial Path and Context Paths to generate high-resolution feature and sufficient receptive field, and then combines the two paths to predict the target. This two-path structure can transfer the computation of depth to two sub-networks and improve the parallelism of sub-networks.

Therefore, smart operators and improved structure is needed in real-time segmentation with the consideration of computing complexity and resource utilization.

III Residual Pyramid Networks

The basic architecture of the proposed Residual Pyramid Network (RPNet) is based on a ResNet-like backbone network, which is shown in Fig. 1. More complicated structures will be discussed in following chapters. An example of backbone network of ENet [25] encoder is shown in Table.(I), which consists of three downsampling blocks, three residual learning blocks and the main leaning block for segmentation. In RPNet, the features of different ResBlocks in the specific level are added up and passed through a predictor to compute residuals, while the top output of the backbone is used to predict the small scale of segmentation. Finally, the predicted residual pyramid is reconstructed to get full segmentation.

III-A Construct a Residual Leaning Block

In ResNet, stacked ResBlocks can be expressed as:

[TABLE]

where $\bm{x}_{l}$ and $\bm{x}_{l+1}$ are input and output of the $l$ -th block, $\mathcal{F}$ is a residual function with the parameters $\mathcal{W}_{l}$ , $h(\bm{x}_{l})$ is an identity mapping and $f$ is an activation. Regardless of the activation $f$ , we can recursively put Eqn.(4) into Eqn.(3) and obtain the output $\bm{x}_{L}$ :

[TABLE]

In Eqn.(5), the residual function $\mathcal{F}$ is the part to be trained and residual feature can be defined as:

[TABLE]

Here we get the residual feature of single level and across the network we can get multiple levels of residual features.

We regard the $(-\bm{res})$ as one stream with high-frequency information of the input, which is to be subtracted from the input feature $\bm{x}_{l}$ to get the other stream of main part $\bm{x}_{L}$ , which maintains more low-frequency information:

[TABLE]

Generally, the $\bm{res}$ of residual block shown in Equition.(6) can be extended to alternative structure of networks beyond ResNet, and extract more residuals as shown in Fig.(2). In plain networks, the residual features can be computed by calculating the difference between high and low level features. The pooling layer also loses much details, so in extended residual structure, we upsample the feature after pooling layer, and compute residual features with the feature before the pooling layer. Then through the whole networks, all information can be used for segmentation.

Then residuals of target pyramid can be predicted with the $(-\bm{res})$ features using the appended predictors. Compared with other feature reconstruction methods, $(-\bm{res})$ has little information overlapping with the main part of $\bm{x}_{L}$ .

III-B Loss Function for Training

As directly decomposing the labels into residual pyramid will cause class conflict and lead to mismatch, we train the reconstructed predicted target and make the network learn the residuals indirectly.

Therefore, we reconstruct the predicted target from top to bottom at training phase and use the reconstructed targets of different levels to compute losses with the scaled labels:

[TABLE]

$\bm{tres}_{i}$ and $\bm{target}_{i}$ are predicted residuals and reconstructed targets at the $i$ -th level, $\bm{label}_{i}$ represents the scaled ground truth label of the $i$ -th level using the nearest neighbor interpolation and $criterion()$ is Cross Entropy Loss. The upsampling of $\bm{target}_{i}$ is needed to add $\bm{target}_{i}$ with $\bm{tres}_{i+1}$ , using bilinear interpolation. In the same way, the residual pyramid of label can be defined as:

[TABLE]

where $\bm{lres}_{i}$ is the label residual of the $i$ -th level. Then the learning process can be described as:

[TABLE]

Recursively, Eqn.(10) is equal to

[TABLE]

which means that after being reconstructed to learn the scaled labels, the residual features are finally transformed into the label residuals. The sum of residual pyramid losses is the final loss:

[TABLE]

which indicates that we can train the losses seperately for different level of residuals.

There is multiple sub-networks of different depth in RPNet, and multiple sub-losses can help the propagation of gradient in backward pass, thereby facilitating the training process.

III-C The residual predictors

The residual predictors followed by residual features are used to predict different levels of residuals. We design two kinds of predictors shown in Fig.(2): basic type and guided type. Basic predictor simply use 1 $\times$ 1 convolutions to predict main part and residuals. Guided type uses the main part feature of last-level to guide the prediction of residuals and uses several 1 $\times$ 1 to adjust the channels. Both of the predictors follow the idea of Section.(III-B) to predict residuals indirectly.

As the traning process is step by step, when training one level of residuals, the main part feature has been trained to be able to recognize the pixel at higher level. We upsample the main feature and concatenate it with bigger residual features, then the combined feature will be easy to train, and lead to a better result. More details can be found in Section.(IV-B).

III-D Process of Training a RPNet

As residual target is the subtraction of the full target at this level and the main output of last level, to ensures that the residual feature focuses on the residual target, we should train the ResBlocks of different levels from top to down step by step. Actually, when we train residual features at last level, the main feature has been well trained to main target. If both of residual and main features are trained from scratch at same time, the targets for residual and main feature will be ambiguous and the network will not converge to optimal solution. The results of evaluation in Table.(IV) also proves validity of our training method.

The loss function for $i$ -th level is

[TABLE]

Take the ENet + RPNet as an example, The backbone network is shown in Table.(I). The Initial and Bottleneck blocks are the same as that proposed in [25], and we reproduce ENet with new Downsampling block shown in Fig.(3).

First, train the main prediction of 1/8 scale of original inputs; second, upsample the main prediction using bilinear interpolation; next execute the pixel-sum between the main; then use the Res2 to train the prediction of 1/4 scale of original inputs, the same way for 1/2 scale; finally, we get a residual pyramid and in order to save computing resources, we directly upsample the 1/2 scale predicted result to get full scale segmentation. The visualization of training process is shown in Fig.(4).

IV Experiments

A set of experiments is conducted to verify the effectiveness and the efficiency. All experiments are evaluated on NVIDIA GTX 1080Ti and NVIDIA JETSON TX2. The main datasets in the experiments are from CamVid [3] and Cityscapes [9].

CamVid The CamVid is a street scene dataset from the perspective of driving automobile, which consists of 701 images with the resolution of 480 $\times$ 360. The CamVid has 367 images for training, 233 for testing, including 11 classes.

Cityscapes The Cityscapes is also a street scene dataset with 5000 fine-annotated images, 2975 for training, 500 for validation and 1525 for testing, at the resolution of 2048 $\times$ 1024, including 19 classes. We only use the fine-annotated images for training, downsample the original images to 1024 $\times$ 512 for training, and interpolate the outputs to 2048 $\times$ 1024 for testing.

IV-A Implementation

Network

To show the performance of RPNet, we select ENet and ERFNet [25, 27] as our baseline models. ENet is one of the fastest models on cityscapes benchmark, and ERFNet [27] is a little bit slower but more accurate. Both of the models have ResBlock in their encoder. The decoders of the two models are removed and the encoders are reformed as RPNet. We also set up a model of an encoder of ENet with a bilinear interpolation directly interpolate to original size as a comparative model. To construct the level-3 residual, we add a regular bottleneck1.0 to original ENet encoder.

Training details

We use Adam optimization with decay 0.0001 and batch size 3. We apply the poly learning rate policy, and the learning rate is multiplied by:

[TABLE]

with power 0.9 and initial learning rate 0.0005. Class weighting scheme is:

[TABLE]

where k is set to 1.12.

Data augmentation

We use random horizontal flip and the transition of 0~2 pixels on both axes of the input images to augment the dataset at training time.

IV-B Ablation Study

We reproduce ENet [25] and compare the different settings: full ENet, ENet encoder with interpolation and ENet encoder with RPNet. The RPNet is trained in a level-wise way. We evaluate the methods on the CamVid test dataset with PASCAL VOC intersection-over-union metric (IoU) [10]. The result is shown in Table.(III).

The absence of the decoder structure helps the ENet encoder be faster than full ENet but at the same time results in a lower level of accuracy. In RPNet, although residual predictors are added, the FLOPs and parameters are almost at the same level as the ENet encoder. Such result is thanks to the reduction of depth and increase of width of the network, which improves the performance in device parallel. The RPNet adds the details to segmentation and makes a great improvement on accuracy, where the higher residuals has greater contributions to the results.

From another aspect, decoder of original ENet has a limited increase of $0.29$ on mean IoU, while RPNet is at least 3.4 higher on ENet. The RPNet-er-gp even has a 4.78 improving on ENet. The RPNet has a distinct advantage on accuracy in the visualization of sample results in Fig.(5).

We also compare the different structures of residuals and predictors. Expanded residuals (er) extract more details from downsampling process. Though not increasing parameters or computation, it enhance the performance significantly. Base predictor (bp) of 1 $\times$ 1 convolution is proved to be effective for residuals as the combination of sr and bp has shown a performance boost. When replaced the bp with guided predictor (gp), the RPNet will be further improved. Finally, the best combination of RPNet is proved to be er and gp.

In parameters and FLOPs comparison, all RPNet has advantages with original encoder-decoder structure. At the same time, the improvement on mean IoU is also significant.

Then to study on the effect on different training methods of the RPNet, we compare the single training with different type of weighted losses mentioned in Impatient DNNs [1] and our level wise training on RPNet-sr-bp. The result is shown in Table.(IV).

In Table.(IV), the uniform weights(EQ), linearly increasing weights(LIN), polynomially increasing weights(POLY) and normal distribution weights NORM, described in [1] are used to train RPNet. However, as analysed in Section(III-C), when we train the different levels from scratch together, though different type of weighted losses applied, the result of mean IoU will be worse than that with top-down level-wise training.

As shown in Fig.(5), the RPNet retains more details than original Enet structures, especially for small and thin objects such as signs and traffic lights, which is retrieved by the independent residual features of different levels.

We also compare the speeds of ENet and RPNet-sr-bp on different platforms, as shown in Table.(II), which indicates that RPNet is more efficient when the computing resources are limited or the input size is large. On TX2, RPNet has an average $23\%$ promotion on fps, and on 1080Ti, RPNet has a $35\%\sim 84\%$ increase of the resolution from $640\times 360$ to $1920\times 1080$ .

As the computation from network depth is reduced and extra computation from network width is cheap, RPNet structure makes better use of computing resources. Note that the implementation on TX2 does not contain any extra acceleration tool such as TensorRT, the further-improved RPNet can be deployed in embedded real-time systems.

IV-C Evaluation on CamVid dataset

PASCAL VOC intersection-over-union metric(IoU) [10] is used to evaluate the methods on CamVid and Cityscapes. An extra sematic instant-level intersection-over-union metric(iIoU) is used on Cityscapes, which focuses on how well the individual instances in the scene are represented in the labeling. In this section, we use expanded residuals and guided predictor as default setting of RPNet.

We construct RPNet on ENet and ERFNet encoders to evaluate the performance. ENet has 87 $Conv$ layers and ERFNet has 75 $Conv$ layers in max depth. After adding RPNet, the max depth comes to 71 and 55 and the parameters and FLOPs comparison are shown in Table.(V).

The Table.(V) indicates that on small inputs with adequate computing resources, ERFNet, which has fewer layers in max depth but more FLOPS, is faster than ENet. But on larger inputs of Cityscapes, shown in Table.(VI), ERFNet with more FLOPS is slower than ENet. After replacing the decoder with RPNet, the FPS increases greatly, especially for RPNet(ERFNet) whose speed even exceeds the ENet. Besides, compared with decoder version of methods, the RPNet also has advantage on mean IoU.

IV-D Evaluation on Cityscapes dataset

The proposed RPNet achieves impressive results compared with the state-of-the-art methods. The iIoU of RPNet increases significantly compared with the baselines, which proves that residual features exactly enhance the details and small objects with residual prediction once again. The improved architecture of segmentation network also increases efficiency.

Besides, Fig.(6) shows the intermediate and final results of RPNet. We can find the lost details retrieved by the RPNet by comparing the column (b) and (c).

In larger inputs, heavy networks show the advantage in learning more characteristics, and the improvement of performance is greater on cityscapes compared with thin networks. ENet, ESPNet and RPNet (ENet) with FLOPs under 9B have lower IoUs than that with higher FLOPs, and RPNet still delivers a better performance. BiSeNet with with two-path structure also has the advantage on both accuracy and efficiency as good as ERFNet. The high speed of BiSeNet also comes from the shallower depth of network whose backbone network is lightweight Xception39 [8]. Still, the bigger inference size of inputs limits the faster speed. ICNet has multi-resolution paths also has high mean IoU, but multiple paths also produce more computation, which makes the ICNet slower.

LRR is much slower because of the heavy backbone networks and two-part prediction at testing time, despite of the 1.8 higher than RPNet (ERFNet) of IoU. RefineNet also achieves high accuracy but has much computation in high-resolution feature reconstruction. However, both LRR and RefineNet have superiority on iIoU as both of them use low-level feature to refine boundary and details of the targets. DeepLab (MobileNet) and ESPNet have less FLOPs but are slower than the same level methods. The reduction of FLOPs is attributed to the depth-wise separate convolution, but such operators are unable to make full use of computing resources.

V Conclusion

In this paper, we have proposed a single-shot method for semantic segmentation based on Residual Pyramid Network (RPNet), which constructs ResBlock to learn residuals of different levels of the target. On this basis, we introduce variations of residual structure and predictor. With the residual learning blocks, the RPNet has better performance compared with methods with complicated feature reconstruction and well-designed decoder structures. At the same time, the single-shot structure makes the RPNet as fast as methods without decoder. Compared with the conventional structure models, RPNet delivers better performance on both efficiency and accuracy. The proposed RPNet is suitable to improve segmentation models with arbitrary structure networks which delete decoder structure and implement single-shot segmentation.

In our experiments, the RPNet is trained step-by-step. Though the test efficiency is improved, but the train period is long. The future works will involve experiments about the learning policy for RPNet to improve efficiency of train phase.

Acknowledgment

This work was supported by the Natural Science Foundations of China (61727802 and BE2018126).

Bibliography35

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] M. Amthor, E. Rodner, and J. Denzler. Impatient dnns - deep neural networks with dynamic time budgets. Co RR , abs/1610.02850, 2016.
2[2] V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence , 39(12):2481–2495, Dec 2017.
3[3] G. J. Brostow, J. Fauqueur, and R. Cipolla. Semantic object classes in video: A high-definition ground truth database. Pattern Recognition Letters , 30:88–97, 2009.
4[4] P. J. Burt and E. H. Adelson. The laplacian pyramid as a compact image code. Readings in Computer Vision , 31(4):671–679, 1987.
5[5] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Transactions on Pattern Analysis and Machine Intelligence , 40(4):834–848, April 2018.
6[6] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Rethinking atrous convolution for semantic image segmentation. Co RR , abs/1706.05587, 2017.
7[7] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous separable convolution for semantic image segmentation. In ECCV , 2018.
8[8] F. Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 1251–1258, 2017.