MultiNet++: Multi-Stream Feature Aggregation and Geometric Loss Strategy   for Multi-Task Learning

Sumanth Chennupati; Ganesh Sistu; Senthil Yogamani; Samir A; Rawashdeh

arXiv:1904.08492·cs.CV·April 23, 2019

MultiNet++: Multi-Stream Feature Aggregation and Geometric Loss Strategy for Multi-Task Learning

Sumanth Chennupati, Ganesh Sistu, Senthil Yogamani, Samir A, Rawashdeh

PDF

TL;DR

This paper introduces MultiNet++, a multi-stream multi-task learning framework that leverages sequential video frames and a geometric loss strategy to improve segmentation, depth, and motion estimation in autonomous driving.

Contribution

It presents a novel multi-stream network architecture and a geometric mean-based loss function for better multi-task learning on video sequences.

Findings

01

Outperforms existing multi-task learning methods on KITTI, Cityscapes, and SYNTHIA datasets.

02

Effectively utilizes preceding frames for improved feature representation.

03

The geometric mean loss enhances convergence across tasks.

Abstract

Multi-task learning is commonly used in autonomous driving for solving various visual perception tasks. It offers significant benefits in terms of both performance and computational complexity. Current work on multi-task learning networks focus on processing a single input image and there is no known implementation of multi-task learning handling a sequence of images. In this work, we propose a multi-stream multi-task network to take advantage of using feature representations from preceding frames in a video sequence for joint learning of segmentation, depth, and motion. The weights of the current and previous encoder are shared so that features computed in the previous frame can be leveraged without additional computation. In addition, we propose to use the geometric mean of task losses as a better alternative to the weighted average of task losses. The proposed loss function…

Tables4

Table 1. Table 1 : Summary of the automotive datasets used in our experiments.

Annotations	KITTI[12]	Cityscapes[6]	SYNTHIA[39]
Segmentation	✓	✓	✓
Depth	✓	✓	✓
Motion	✓	✓	×
# Train	200	2,975	888
# Validation	200	500	787
# Type	Real	Real	Synthetic

Table 2. Table 2 : Comparative study: Parameters needed to construct 1-task segmentation, depth and motion, 2-task segmentation and depth, 2-task segmentation and motion and 3-task segmentation, depth and motion models. We compare 2-task and 3-task models that operate on 1-frame and 2-frames.

Method	KITTI & Cityscapes					SYNTHIA
Method	Encoder	Segmentation	Depth	Motion	Total	Encoder	Segmentation	Depth	Total
1-Task Segmentation, Depth or Motion
1-Task	23.58M	0.18M	-	-	23.77M	23.58M	0.14M	-	23.68M
1-Task	23.58M	-	3.88K	-	23.59M	23.58M	-	3.87K	23.59M
1-Task	23.58M	-	-	8.33K	23.60M	-	-	-	-
2-Task Segmentation and Depth
1-Frame	23.58M	0.18M	3.88K	-	23.77M	23.58M	95.34K	3.88K	23.69M
2-Frames	23.58M	0.26M	7.46K	-	23.86M	23.58M	0.14M	7.46K	23.74M
2-Task Segmentation and Motion
1-Frame	23.58M	0.18M	-	8.33K	23.78M	-	-	-	-
2-Frames	23.58M	0.26M	-	15.50K	23.86M	-	-	-	-
3-Task Segmentation, Depth and Motion
1-Frame	23.58M	0.18M	3.88K	8.33K	23.79M	-	-	-	-
2-Frames	23.58M	0.26M	7.46K	15.50K	23.87M	-	-	-	-

Table 3. Table 3 : Improvements in learning segmentation, depth estimation and motion detection as multiple tasks using equal weights, proposed geometric loss strategy (GLS) and 2 stream feature aggregation with GLS (MultiNet++) vs independent networks (1-Task) on KITTI, Cityscapes and SYNTHIA datasets.

Method	KITTI			Cityscapes			SYNTHIA
Method	Segmentation	Depth	Motion	Segmentation	Depth	Motion	Segmentation	Depth
1-Task Segmentation, Depth or Motion
1-Task	81.74%	-	-	78.95%	-	-	84.08%	-
1-Task	-	75.91%	-	-	60.13%	-	-	73.19%
1-Task	-	-	98.49%	-	-	98.72%	-	-
2-Task Segmentation and Depth
Equal weights	74.30%	74.47%	-	73.76%	59.38%	-	63.45%	71.84%
GLS (ours)	81.50%	74.92%	-	79.14%	60.15%	-	86.87%	73.60%
MultiNet++	81.01%	73.95%	-	83.07%	60.15%	-	88.15%	78.39%
2-Task Segmentation and Motion
Equal weights	80.14%	-	97.88%	78.46%	-	98.25%	-	-
GLS (ours)	81.52%	-	97.93%	77.63%	-	98.83%	-	-
MultiNet++	81.75%	-	98.15%	78.86%	-	98.65%	-	-
3-Task Segmentation, Depth and Motion
Equal weights	77.14%	76.15%	97.83%	72.71%	60.97%	98.20%	-	-
GLS (ours)	82.20%	76.54%	97.92%	77.38%	61.56%	98.72%	-	-
MultiNet++	80.06%	73.94%	97.94%	82.36%	62.74%	98.21%	-	-

Table 4. Table 4 : Comparative Study: Performance of 1-Task, equal weights, 3-task uncertainty [ 20 ] , Dynamic Weight Average (DWA) [ 26 ] and proposed geometric loss strategy (GLS) on KITTI and Cityscapes datasets.

KITTI
Method	Segmentation	Depth	Motion
1-Task	81.74%	75.91%	98.49%
Equal weights	77.14%	76.15%	97.83%
Uncertainty [20]	78.93%	75.73%	98.00%
DWA [26]	80.05%	74.48%	97.78%
GLS (ours)	82.20%	76.54%	97.92%
Cityscapes
1-Task	78.95%	60.13%	98.72%
Equal weights	72.71%	60.97%	98.20%
Uncertainty [20]	77.32%	60.44%	98.63%
DWA [26]	78.05%	59.34%	98.45%
GLS (ours)	77.38%	61.56%	98.72%

Equations12

L_{T o t a l} = i = 1 \prod n n L_{i}

L_{T o t a l} = i = 1 \prod n n L_{i}

L_{T o t a l} = 3 L_{1} L_{2} L_{3}

L_{T o t a l} = 3 L_{1} L_{2} L_{3}

L_{T o t a l} = i = 1 \prod n n L_{i} \times j = 1 \prod m m L_{j}

L_{T o t a l} = i = 1 \prod n n L_{i} \times j = 1 \prod m m L_{j}

L_{S e g} or L_{M o t i o n} = - j = 1 \sum N i = 1 \sum C y_{i, j} l o g (p_{i, j})

L_{S e g} or L_{M o t i o n} = - j = 1 \sum N i = 1 \sum C y_{i, j} l o g (p_{i, j})

\mathcal{L}_{Depth}=\left\{\begin{array}[]{ll}\frac{1}{2}\left[y-\hat{y}\right]^{2}&:|y-\hat{y}|\leq\delta\\ \delta\left(|y-\hat{y}|-\delta/2\right)&:otherwise\end{array}\right.

\mathcal{L}_{Depth}=\left\{\begin{array}[]{ll}\frac{1}{2}\left[y-\hat{y}\right]^{2}&:|y-\hat{y}|\leq\delta\\ \delta\left(|y-\hat{y}|-\delta/2\right)&:otherwise\end{array}\right.

L_{T o t a l} = 3 L_{S e g} L_{D e pt h} L_{M o t i o n}

L_{T o t a l} = 3 L_{S e g} L_{D e pt h} L_{M o t i o n}

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

MultiNet++: Multi-Stream Feature Aggregation and Geometric Loss Strategy for Multi-Task Learning

Sumanth Chennupati1,3, Ganesh Sistu2, Senthil Yogamani2 and Samir A Rawashdeh3

1Valeo North America, 2Valeo Vision Systems, 3University of Michigan-Dearborn

[email protected], [email protected], [email protected], [email protected]

Abstract

Multi-task learning is commonly used in autonomous driving for solving various visual perception tasks. It offers significant benefits in terms of both performance and computational complexity. Current work on multi-task learning networks focus on processing a single input image and there is no known implementation of multi-task learning handling a sequence of images. In this work, we propose a multi-stream multi-task network to take advantage of using feature representations from preceding frames in a video sequence for joint learning of segmentation, depth, and motion. The weights of the current and previous encoder are shared so that features computed in the previous frame can be leveraged without additional computation. In addition, we propose to use the geometric mean of task losses as a better alternative to the weighted average of task losses. The proposed loss function facilitates better handling of the difference in convergence rates of different tasks. Experimental results on KITTI, Cityscapes and SYNTHIA datasets demonstrate that the proposed strategies outperform various existing multi-task learning solutions.

1 Introduction

Multi-task learning (MTL) [2] aims to jointly solve multiple tasks by leveraging the underlying similarities between independent or interdependent tasks. It is perceived as an attempt to improve generalization by learning a common feature representation for multiple tasks. Improvements in prediction accuracy and reduced computation complexities are significant benefits of MTL. This allowed deployment of MTL in various applications in computer vision (especially scene understanding) [55, 22, 4], natural language processing [43, 11], speech recognition [57, 50], reinforcement learning [9, 8], drug discovery [34, 25], etc.

MTL networks were mainly built using Convolution Neural Networks (CNNs). These networks were usually limited to operate on a single stream of input data. However, numerous works demonstrate using multiple streams of data as input to CNNs can improve performance drastically compared to using a single stream of input data. Recent attempts that use consecutive frames in a video sequence for semantic segmentation [46, 51, 48], activity recognition [19, 49], optical flow estimation [35], moving object detection [47, 56] are examples demonstrating the benefits of using multiple streams of input data. Similarly, a pair of images from stereo vision cameras [28] or multiple images from different cameras of a surround view system of a car can also be processed as multiple streams of input to CNNs. Some works considered processing input data from different domains [41] to solve certain tasks that require multi-modal data representations.

These significant benefits demand the construction of a multi-task learning network that can operate on multiple streams of input data. Thus, we propose MultiNet++, a novel multi-task network using simple feature aggregation methods as shown in Figure 1 to combine multiple streams of input data, which can be further processed by task-specific decoders. Figure 1 illustrates a generic way to aggregate features temporally and we make use of a simple summation junction to combine temporal features in our experiments. MultiNet++ would be ideal to process video sequences for tasks like semantic segmentation, depth estimation, optical flow estimation, object detection and tracking, etc. with improved efficiency. We also propose a novel loss strategy for multi-task learning based on geometric mean representation to prioritize learning of all tasks equally. The motivation for MultiNet++ is derived from our position paper NeurAll [52] which proposes to move towards a unified visual perception model for autonomous driving. We propose to use three diverse tasks namely segmentation, depth estimation and motion segmentation which make use of appearance, geometry and motion cues respectively.

The rest of the contents in this paper are structured as follows. Section 2 reviews related work using feature aggregation for multiple streams of inputs to CNNs and different task loss weighing strategies used in MTL. Section 3 discusses in detail the proposed MultiNet++ network along with the geometric loss strategy used in this paper. Section 4 presents the experimental results on automotive datasets mainly KITTI [12], Cityscapes [6] and SYNTHIA [39]. Finally, Section 5 summarizes the paper with key observations and concluding remarks.

2 Related Work

2.1 Multi-Task Learning

Multi-task learning typically consists of two blocks, shared parameters, and task-specific parameters. Shared parameters are learned to represent commonalities between several tasks while task-specific parameters are learned to perform independent processing. In MTL networks built using CNNs, shared parameters are called encoders as they perform the key feature extraction and the task-specific parameters are called decoders as they decode the information from encoders. MTL networks are classified into hard parameter sharing or soft parameter sharing categories based on how they share their parameters. In hard parameter sharing, initial layers or parameters are shared between different tasks such that these parameters are common for all tasks. In soft parameter sharing, different tasks are allowed to have different initial layers with some extent of sharing between them. Cross stitch [31] and sluice networks [40] are examples of soft parameter sharing. Majority of the works in MTL use hard parameter sharing as it is easier to build and computationally less complex.

The performance of the MTL network is highly dependent on their shared parameters as they contain the knowledge learned from different tasks [2, 1, 38]. Inappropriate learning of these parameters can induce biased representations for a particular task which can hurt the performance of MTL networks. This phenomenon is referred to as negative transfer learning. In order to prevent it, meaningful feature representations and balanced learning methods are required.

2.2 Feature Aggregation

Different outputs from initial or mid-level convolution layers from CNNs (referred to as extracted features) are forwarded to the next stage of processing using feature aggregation. Feature aggregation is a meaningful way to combine these extracted features. These features can be extracted from different CNNs operating on different input data [62, 37] or from a CNN operating on different resolutions of input [24]. Ranjan et al. [36] combines intermediate outputs from a CNN and passes to next stages of processing. Yu et al. [60] proposed several possibilities of feature aggregation.

There are plenty of choices to perform feature aggregation. These choices range from using simple concatenation techniques to complex Long Short Term Memory units (LSTMs) [17] or recurrent units. Simple concatenation or addition layers can capture short term temporal cues from a video sequence. Sun et al. [54] combine spatial and temporal features from video sequences for human activity recognition and Karpathy et al. [19] combine features from inputs separated by 15 frames in a video for classification. Hei Ng et al. [32] proposed several convolution and pooling operations to combine features for video classification while Sistu et al. [51] used simple 1×1 bottleneck convolutions to combine features from consecutive frames for video segmentation.

In automotive or indoor robotic visual perception problems, simple concatenation techniques perform well but they fall short in some applications like video captioning [10, 33] or summarization [42] where long term dependencies are required. LSTMs in such cases offer a better alternative [59, 45]. Convolution-LSTMs (Conv-LSTMs) [58, 53] and 3D convolutions [18] are other options. However, these options incur additional computational complexity and they are needed mainly for aggregation of features that are significant for long term dependencies.

2.3 Multi-Task Loss

With the growing popularity of MTL, it is worth considering the possibility of imbalances in training an MTL network. It is often observed that some tasks dominate others during the training phase [14]. This dominance can be attributed to variations in task heuristics like complexities, uncertainties, and magnitudes of losses etc. Therefore an appropriate loss or prioritization strategy for all tasks in an MTL is a necessity.

Early works in MTL [55, 22], use a weighted arithmetic sum of individual task losses. Later, several works attempted to balance the task weights using certain task heuristics discussed earlier. Kendall et al. [20] proposed to use homoscedastic uncertainty of tasks to weigh them. This approach requires explicit modeling of uncertainty and more importantly, the task weights remain constant.

GradNorm [3] is another notable work in which Chen et al. proposes to normalize gradients from all tasks to a common scale during backpropagation. Lui et al. [26] proposed Dynamic Weight Average (DWA) which uses an average of task losses over time to weigh the task losses. Guo et al. [14] on the other hand proposed dynamic task prioritization where the changes in the difficulty of tasks adjust the task weights. This allows distributing focus on harder problems first and then on less challenging tasks. On another hand, Liu et al. devised a different strategy to use a reinforcement learning based approach to learn optimal task weights. However, this method isn’t simple and it brings additional complexity to the training phase.

In contrast to modeling multi-task problem as a single objective problem, Sener and Koltun [44] proposed to model it as a multi-optimization problem. Zhang and Yeung [61] proposed a convex formulation for multi-task learning and Desideri [7] proposed a multiple-gradient descent algorithm. In summary, these strategies either involve an explicit definition of loss function using task heuristics or require complex optimization techniques. Therefore, a loss strategy with minimal design complexities will be well suited for multi-task learning to accommodate a virtually unlimited number of joint tasks.

3 Proposed Solution

We introduce our novel multi-task network MultiNet++, that is capable of processing multiple streams of input data. The proposed architecture is scalable and can be readily applied in any multi-task problem. In the following subsection, we discuss how we built our MultiNet++ network shown in Figure 2.

3.1 Multi-stream Multi-task Architecture

MultiNet++ is a simple multi-task network with the ability to process multiple streams of input data. It is built using three main components, 1) Encoders that feed multiple streams of input into the network, 2) Feature aggregation layers that concatenate the encoded feature vectors from multiple streams and 3) Task-specific decoders that operate on aggregated feature space to perform task-specific operations. In this paper, we use MultiNet++ for joint semantic segmentation, depth estimation and moving object detection (or simply motion) on video sequences. We share the encoder between two consecutive frames from a given video sequence as shown in Figure 2. This can significantly reduce the computational load as the encoders require a daunting number of parameters. These input frames can be selected sparsely or densely from a video sequence by observing its motion histogram. One can also choose to pass keyframes as proposed by Kulhare et al. [23].

Our encoders are selected by removing fully connected layers from ResNet-50 [16]. Outputs from ReLU [15] activation at layers 23, 39 and 46 from ResNet-50 [16] encoder are extracted and sent to feature aggregation layers. These feature maps extracted from different streams of inputs are concatenated and sent to task-specific decoders as shown in Figure 1. Segmentation decoder is built using FCN8 [27] architecture that comprises of 3 upsampling layers and skip connections from aggregated feature maps as shown in Figure 2. The final layer consists of softmax [13] units to predict pixel-wise classification labels. Similarly, we construct a motion decoder by changing the number of output classes in softmax units. Depth decoder is built by replacing softmax with regression units.

3.2 Geometric Loss Strategy

We discussed the importance of a loss strategy that requires minimal effort during design phase in Section 2.3. The commonly used loss combination function is arithmetic mean and it suffers from differences in the scale of the individual losses. This is partially alleviated by weighted average of the losses but it is difficult to tune manually. We were motivated to explore geometric loss combination which is invariant to the scale of the individual losses. Thus we express the total loss of a multi-task learning problem as geometric mean of individual task losses. We refer to this as Geometric Loss Strategy (GLS). For an n-task problem with task losses $`\mathcal{L}_{1}$ ’,‘ $\mathcal{L}_{2}$ ’ $\dots$ ‘ $\mathcal{L}_{n}$ ’, we express total loss as:

[TABLE]

For example, in a 3-task problem with losses ‘ $\mathcal{L}_{1}$ ’,‘ $\mathcal{L}_{2}$ ’ and ‘ $\mathcal{L}_{3}$ ’, we express total loss:

[TABLE]

Equations 1 and 2 are quite popular in geometric programming. This loss function is differentiable and can be optimized using an optimizer like Stochastic Gradient Descent (SGD). In fact, this definition makes sure that all tasks are making progress. We adapt our loss function to focus or give more attention to certain tasks by introducing Focused Loss Strategy (FLS) where we multiply geometric mean of losses of focused tasks to existing loss function. In this case, we define loss function with focus on $m$ $(m\leq n)$ important tasks as:

[TABLE]

Equation 3 provides an opportunity to focus on important tasks in a multi-task learning problem. Here we assume that the tasks are ordered in terms of priority so that first $m$ tasks out of the total $n$ tasks gets higher weightage.

Application of $log$ function converts the product of losses to sum of $log$ of individual losses and thus can be interpreted to be equivalent to normalizing individual losses and then adding them. However, it is computationally complex to make use of $log$ function.

4 Experiments and Results

In this section, we discuss the datasets used for evaluating the efficacy of the proposed models. Later, we discuss in detail how we constructed the proposed models and provide a complexity analysis of each. We also discuss the optimization strategies used during the training phase. Finally, we provide the results obtained along with a discussion.

4.1 Datasets

KITTI [12], Cityscapes [6] and SYNTHIA [39] are popular automotive datasets. KITTI has annotations for several tasks including semantic segmentation, depth estimation, object detection, etc. However, these annotations were done separately for each task and the input is not always common across the tasks. KITTI Stereo 2015 [30, 29] dataset provides stereo images for depth estimation. A subset of these images is labeled for KITTI semantic segmentation [12]. This dataset consists of 200 train images and 200 test images. Cityscapes [6] dataset provides both segmentation and depth estimation annotations for $\approx$ 3500 images. Motion labels for these datasets are provided by Vertens et al. [56]. SYNTHIA [39] is a synthetic dataset that provides segmentation and depth annotations for raw video sequences simulated in different weather, light conditions and road types. KITTI [12] and Cityscapes [6] provide segmentation labels for 20 categories while SYNTHIA [39] dataset provides segmentation labels for 13 categories.

In KITTI [12] and Cityscapes [6] datasets, images are sampled and annotated sparsely from raw videos. This poses a challenge to approaches that use temporal methods for segmentation or motion detection tasks in videos. In addition to KITTI [12] and Cityscapes [6] datasets, we use SEQS-02 (New York-like city) and SEQS-05 (New York-like city) from SYNTHIA dataset for training and validation respectively in our experiments. These sequences provide segmentation and depth annotations for consecutive images in a video sequence. Thus they are more suitable for evaluating our multi-task model which operates on multiple streams of input data. Table 1 provides a summary of different properties of the 3 datasets discussed so far.

4.2 Model Analysis

We constructed several models to evaluate the benefits of the proposed MultiNet++. We build 3 single task baseline models for segmentation, depth and motion tasks using ResNet-50 [16] as an encoder and different task-specific decoders as discussed in Section 3.1. Segmentation decoder predicts pixel-wise labels from 20 different categories for input in KITTI [12] & Cityscapes [6] datasets, while the decoder predicts from 13 categories in SYNTHIA [39] dataset. Depth decoder outputs a 16-bit integer at every pixel location to predict depth and motion decoder predicts a binary classification label for every pixel to classify as moving or static object. These models process one frame of input data. We also constructed 2-task and 3-task models that operate on a single frame and 2 consecutive frames of an input video sequence. MultiNet++ refers to models that operate on 2 consecutive frames which are built using feature aggregation as discussed in Section 3.1. Table 2 provides details about number parameters required to construct different models.

Majority of computational load arises from ResNet-50 [16] encoder. Due to this property, 2-task and 3-task models required the almost same number of parameters as 1-task model. This is one of the main reasons why multi-task networks are computationally efficient and favor embedded deployment. We build our 2-frame models with relatively very little increase in complexity ( $\approx$ 100K parameters) by reusing the encoder between 2-frames. In 2-frames model, the aggregated features are larger in size when compared to the 1-frame model. It resulted in an increase of parameters.

4.3 Optimization

We implemented our proposed models using Keras [5]. In all our experiments, we re-size the input images to 224×384. We used only 2-frames for feature aggregation because adding more frames would increase computational complexity with insignificant performance gains as demonstrated by Sistu et al. [51]. In our multi-task learning networks, we define the loss functions for each task separately and feed them to our geometric loss strategy (GLS) proposed in Section 2.3. For semantic segmentation and motion, we use pixel-wise cross-entropy loss for $C$ classes averaged over a mini-batch with $N$ samples as shown in Equation 4.

[TABLE]

For depth estimation, we use Huber loss as defined in Equation 5 with $\delta$ =250.

[TABLE]

The total loss $\mathcal{L}_{Total}$ is defined as:

[TABLE]

We optimize this loss function in our training phase using Adam optimizer [21]. Accuracy is used as an evaluation metric for segmentation and motion tasks while regression accuracy is used for depth estimation.

4.4 Results

In Table 3, we compare the results of 2-task models and 3-task models using our geometric loss strategy (GLS) against naive equal task weight method. We also compare their performances with 1-task segmentation, depth and motion models. Our GLS method shows significant improvements in performance over equal weights method in both 2-task and 3-task models. In Table 4, we compare the results of 3-task models using our geometric loss strategy (GLS) against naive equal task weights, uncertainty weight method proposed by Kendal et al. [20] and Dynamic Weight Average (DWA) proposed by Liu et al. [26]. In Figure 4 (4(a), 4(b), 4(c), 4(d), 4(e) and 4(f)), we show how validation loss for these models change over time during training phase. Our models using GLS demonstrated faster convergence on all tasks.

In 3-task models solving for segmentation, depth, and motion, depth is usually the most complex task. Figures 4(b) and 4(e) show that depth estimation on KITTI [12] and Cityscapes [6] requires longer convergence time compared to segmentation (Figures 4(a) and 4(d)) and motion tasks (Figures 4(c) and 4(f)). In these cases, our GLS method has shown faster convergence compared to uncertainty [20] and DWA [26] methods. While solving for multiple tasks, uncertainty [20] and DWA [26] weigh the tasks that converge quickly higher than the others. This led to faster convergence in segmentation and motion tasks but late convergence in depth task. In such circumstances, the encoder parameters might be biased towards segmentation and motion tasks. This can result in imbalanced learning of depth task. Our GLS method expresses the total loss as the geometric mean of individual losses, so it doesn’t prioritize one task higher than others. In this way, we achieve balanced training and improved performances compared to other techniques.

In Table 3, we also compare 2-task and 3-task models with our novel MultiNet++ which uses both feature aggregation (for 2-frame input) and GLS. In KITTI [12] dataset, input images are sparsely sampled from raw video sequences which hinder the performance gains of MultiNet++. In Cityscapes [6] dataset, MultiNet++ outperforms single task models by 4% and 3% for segmentation and depth tasks respectively as they provide images sampled closely compared to KITTI dataset. These improvements are much better in SYNTHIA [39] dataset (4% and 5% for segmentation and depth estimation tasks respectively) as they provide continuous frames of video sequences. We achieve similar performances for motion task compared to 1-task models.

We compare qualitative results of MultiNet++ with 1-task segmentation model on Cityscapes [6] dataset in Figure 3. The main difference between 1-task models and 3-task models is that the latter have learned representations from other tasks using a common encoder. Knowledge acquired through these representations helps 3-task model to identify semantic boundaries better compared to 1-task model. It is clearly evident that MultiNet++ model has improved performance. Our models detect traffic signs, lights and other near range objects better compared to other models on KITTI dataset [12] as shown in Figure 5.

5 Conclusion

We introduced an efficient way of constructing MultiNet++, a multi-task learning network that operates on multiple streams of input data. We demonstrated that our geometric loss strategy (GLS) is robust to different task heuristics like complexity, magnitude, etc. We achieved balanced training and improved performances for a multi-task learning network solving different tasks namely segmentation, depth estimation and motion on automotive datasets KITTI, Cityscapes, and SYNTHIA. Our GLS strategy is easy to implement and most importantly it allows for balanced learning of a large number of tasks in multi-task learning without requiring explicit loss modeling when compared to other multi-task learning loss strategies. In the future, we would like to explore the benefits of multi-task learning networks using our efficient feature aggregation and loss strategies for multi-modal data.

Acknowledgements

Authors would like to thank their employer for supporting fundamental research. Authors would also like to thank Dr. Aditya Viswanathan and Dr. Thibault Julliand for helpful discussions.

Bibliography62

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] H. Bilen and A. Vedaldi. Universal representations: The missing link between faces, text, planktons, and cat breeds. ar Xiv preprint ar Xiv:1701.07275 , 2017.
2[2] R. Caruana. Multitask learning. Machine Learning , 28(1):41–75, Jul 1997.
3[3] Z. Chen, V. Badrinarayanan, C.-Y. Lee, and A. Rabinovich. Gradnorm: Gradient normalization for adaptive loss balancing in deep multitask networks. In ICML , 2018.
4[4] S. Chennupati, G. Sistu., S. Yogamani., and S. Rawashdeh. Auxnet: Auxiliary tasks enhanced semantic segmentation for automated driving. In Proceedings of the 14th International Joint Conference on Computer Vision, Imaging and Computer Graphics Theory and Applications - Volume 5: VISAPP, , pages 645–652. INSTICC, Sci Te Press, 2019.
5[5] F. Chollet et al. Keras. https://keras.io , 2015.
6[6] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson, U. Franke, S. Roth, and B. Schiele. The cityscapes dataset for semantic urban scene understanding. In Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , 2016.
7[7] J.-A. Désidéri. Multiple-gradient descent algorithm (mgda) for multiobjective optimization. Comptes Rendus Mathematique , 350(5-6):313–318, 2012.
8[8] C. Devin, A. Gupta, T. Darrell, P. Abbeel, and S. Levine. Learning modular neural network policies for multi-task and multi-robot transfer. In 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages 2169–2176, May 2017.