Learning Task Relatedness in Multi-Task Learning for Images in Context

Gjorgji Strezoski; Nanne van Noord; Marcel Worring

arXiv:1904.03011·cs.CV·April 8, 2019

Learning Task Relatedness in Multi-Task Learning for Images in Context

Gjorgji Strezoski, Nanne van Noord, Marcel Worring

PDF

TL;DR

This paper introduces Selective Sharing, a method that learns task relatedness automatically during multi-task learning for images, improving accuracy and efficiency across various datasets without relying on explicit domain knowledge.

Contribution

The paper proposes a novel approach to learn inter-task relatedness dynamically, enabling automatic task grouping and knowledge sharing in multi-task learning without prior domain knowledge.

Findings

01

Consistent accuracy improvements over baselines and state-of-the-art methods.

02

Reduces parameter counts while maintaining performance.

03

Provides insights into learned representations through activation analysis.

Abstract

Multimedia applications often require concurrent solutions to multiple tasks. These tasks hold clues to each-others solutions, however as these relations can be complex this remains a rarely utilized property. When task relations are explicitly defined based on domain knowledge multi-task learning (MTL) offers such concurrent solutions, while exploiting relatedness between multiple tasks performed over the same dataset. In most cases however, this relatedness is not explicitly defined and the domain expert knowledge that defines it is not available. To address this issue, we introduce Selective Sharing, a method that learns the inter-task relatedness from secondary latent features while the model trains. Using this insight, we can automatically group tasks and allow them to share knowledge in a mutually beneficial way. We support our method with experiments on 5 datasets in…

Tables2

Table 1. Table 3 . Performance comparison for the OmniArt dataset between a baseline STL approach, a baseline MTL approach and Selective Sharing with Similarity. For every attempt we list the final number of trainable parameters, which for the STL approach is the sum of all the trainable parameters for all tasks. Since the feature extraction is fixed, the parameter count contains only the task specific branches and the hard shared layer.

Task	AA	GC	CCC	SC	MC	STC	CYE	# Params	Lock Epoch	# Branches
Metric	Accuracy (%)						MAE (years)	# Params	Lock Epoch	# Branches
STL Baseline	28.0	25.0	32.1	24.7	41.3	19.2	144.32	3,487,664	N/A	N/A
MTL Baseline	31.0	26.2	27.7	23.0	42.0	22.1	135.43	1,630,664	N/A	7
Selective Sharing - Similarity	33.8	28.0	29.0	27.6	44.1	21.2	128.11	908,264	10	3

Table 2. Table 5 . Performance comparison between a baseline STL approach, a baseline MTL approach and three Selective Sharing strategies on the OmniArt dataset. For every attempt we list the final number of trainable parameters, which for the STL approach is the sum of all the trainable parameters for all tasks. Since the feature extraction is fixed, the parameter count contains only the task specific branches and the hard shared layer. Best overall performance is achieved by Selective Sharing with similarity.

Task	AA	GC	CCC	SC	MC	STC	CYE	# Params	Lock Epoch	# Branches
Metric	Accuracy (%)						MAE (years)	# Params	Lock Epoch	# Branches
STL Baseline	28.0	25.0	32.1	24.7	41.3	19.2	144.32	3,487,664	N/A	N/A
MTL Baseline	31.0	26.2	27.7	23.0	42.0	22.1	135.43	1,630,664	N/A	7
Selective Sharing - Similarity	33.8	28.0	29.0	27.6	44.1	21.2	128.11	908,264	10	3
Selective Sharing - Dissimilarity	39.40	21.10	28.10	23.82	42.86	19.01	137.05	727,664	7	2
Selective Sharing - Variance	43.73	26.85	31.20	26.72	40.43	22.25	129.79	727,664	7	2

Equations14

G_{t} = \frac{\partial E _{t}}{\partial w _{ij}} = k \in K_{t} \sum (\overset{y}{^}_{k} - y_{k}) g_{k}^{'} (x_{k}) \frac{\partial}{\partial w _{ij}} x_{k},

G_{t} = \frac{\partial E _{t}}{\partial w _{ij}} = k \in K_{t} \sum (\overset{y}{^}_{k} - y_{k}) g_{k}^{'} (x_{k}) \frac{\partial}{\partial w _{ij}} x_{k},

G_{t} = δ_{t j} a_{i},

G_{t} = δ_{t j} a_{i},

G = t = 1 \sum ∣ T ∣ G_{t},

G = t = 1 \sum ∣ T ∣ G_{t},

B_{T} = i = 1 \sum d D_{i} (α_{i - 1}, f_{i}, α_{i}),

B_{T} = i = 1 \sum d D_{i} (α_{i - 1}, f_{i}, α_{i}),

C_{t} = [B_{0} B_{1} B_{2} ... B_{m}]^{T},

C_{t} = [B_{0} B_{1} B_{2} ... B_{m}]^{T},

d_{M - k} (C_{t_{i}}, C_{t_{j}}) = ma x {cor e_{k} (C_{t_{i}}), cor e_{k} (C_{t_{j}}), d (C_{t_{i}}, C_{t_{j}})},

d_{M - k} (C_{t_{i}}, C_{t_{j}}) = ma x {cor e_{k} (C_{t_{i}}), cor e_{k} (C_{t_{j}}), d (C_{t_{i}}, C_{t_{j}})},

l_{r ank} (x_{i}, x_{j}; f) = ma x (0, 1 - (f (x_{i}) - f (x_{j})))^{2},

l_{r ank} (x_{i}, x_{j}; f) = ma x (0, 1 - (f (x_{i}) - f (x_{j})))^{2},

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Learning Task Relatedness in Multi-Task Learning

for Images in Context

Gjorgji Strezoski

University of AmsterdamAmsterdamNetherlands

[email protected]

,

Nanne van Noord

University of AmsterdamAmsterdamNetherlands

[email protected]

and

Marcel Worring

University of AmsterdamAmsterdamNetherlands

[email protected]

Abstract.

Multimedia applications often require concurrent solutions to multiple tasks. These tasks hold clues to each-others solutions, however as these relations can be complex this remains a rarely utilized property. When task relations are explicitly defined based on domain knowledge multi-task learning (MTL) offers such concurrent solutions, while exploiting relatedness between multiple tasks performed over the same dataset. In most cases however, this relatedness is not explicitly defined and the domain expert knowledge that defines it is not available. To address this issue, we introduce Selective Sharing, a method that learns the inter-task relatedness from secondary latent features while the model trains. Using this insight, we can automatically group tasks and allow them to share knowledge in a mutually beneficial way. We support our method with experiments on 5 datasets in classification, regression, and ranking tasks and compare to strong baselines and state-of-the-art approaches showing a consistent improvement in terms of accuracy and parameter counts. In addition, we perform an activation region analysis showing how Selective Sharing affects the learned representation.

††copyright: none

1. Introduction

In many multimedia applications, there are not only multiple sources of information, but a multitude of tasks to perform as well. Whether segmenting sketches of animals (Sarvadevabhatla et al., 2017), recognizing birds by attribute appearance (Alami Mejjati et al., 2018), attributing artists to paintings (Strezoski and Worring, 2017a) or classifying facial features (Huang et al., 2018; Liu et al., 2018), state-of-the-art results (Alami Mejjati et al., 2018; Huang et al., 2018; Liu et al., 2018) show that exploiting task relatedness in a multi-task learning (MTL) setting benefits performance. For example, an artwork can be described by its creation period, style, technique, genre, type, or artist, as well as with content descriptors such as brush-stroke frequency, lighting direction or repeated texture patterns. Learning to recognize or predict each of these contextual attributes is a separate task, the solution of which may benefit the other tasks in the pool. This way, if one of the attributes is missing, or unknown, the knowledge from learning other contextual attributes, both visual and textual, can help in narrowing down the space of potential values for the missing attribute. In other words, if we recognize a painting has been made in the 17th century in the Dutch school of painting and contains a gray cloudy landscape, in an artist attribution task we could short-list a few artists as the possible creator.

The main benefit of MTL is leveraging multiple shared information sources, especially when a mutual dependence is present to improve the process of solving multiple tasks at once (Li et al., 2017; Sarvadevabhatla et al., 2017; He et al., 2017). While theoretically the gain from MTL is enticing, fully exploiting its benefits is a tedious process in practice. Often this is due to the lack of expert knowledge about the data posing the MTL challenge. Knowing how to setup the shared representation in a MTL setup can play a key difference in the model’s performance. However, a recurring theme in existing MTL approaches is that the task relatedness still needs to be specified before training based on prior knowledge. Noticing task relatedness comes naturally to humans, but for MTL models to do the same a high level of domain expert knowledge is required which is often expensive and rarely available. In this work we propose a method - Selective Sharing - for learning task relatedness during training, and modifying the network architecture on-the-fly with no prior knowledge.

Selective Sharing makes it possible to optimize multiple tasks at once by only sharing the weights and parameters between tasks. Sharing occurs when their relatedness satisfies an influence metric derived from latent secondary features, such as gradient factors. As visible in Figure 1, we measure task similarity during training and cluster tasks with similar responses to the same stimuli. Upon satisfying the metric these tasks form groups and continue training together.

This approach iteratively reduces the size of the parameter space compared to other MTL approaches without the requirement of a prior inter-task relationship specification. Grouping tasks allows them to share and fit a mutual parameter space. This reduces trainable parameter count in half per pair of tasks, once one of the task specific branches is no longer used. With a greedy starting position, the parameter reduction properties of Selective Sharing shorten training epoch duration and decrease memory requirements for the final model.

In this paper we elaborate on the following three contributions:

•

We propose Selective Sharing, a new approach for MTL that performs data driven task grouping without requiring prior knowledge.

•

We demonstrate our approach’s ability to make explicit the latent task relatedness in MTL supported with task specific activation region analyses.

•

We improve or maintain performance in classification, regression, and ranking tasks while reducing the size of the parameter space.

2. Related Work

MTL has developed significantly from the moment it was introduced (Caruana, 1997; Stein, 1956). While originally it was a ’one size fits all’ methodology, multiple MTL design patterns have been identified depending on the type of data that is being modeled, the type of sharing between the different tasks or the different levels on which a mutual representation is created. Additionally, with MTL comes a natural urge to simplify the models at hand and group the tasks that would benefit each other’s learning process. With this mutually beneficial task relationship in mind, there are numerous domains and modalities (Zamir et al., 2018; Ruder et al., 2017; Li et al., 2017; Sarvadevabhatla et al., 2017; He et al., 2017; Jou and Chang, 2016; Wang et al., 2016; Kaneko et al., 2016) where the MTL methodology can be applied. As such, MTL is often used implicitly without a specific reference in methods such as transfer learning and fine-tuning (Caruana, 1997; Sharif Razavian et al., 2014) as well. Given this broad scope of MTL in this section we focus on MTL in the context of systems that exploit visual information with their accompanying metadata.

2.1. Multi-task Learning

Whenever we find ourselves optimizing for more than one loss function, we are essentially doing MTL. This is in contrast to single task learning (STL) where we optimize only one loss function (Ruder, 2017). Caruana in 1997 also defined it as training tasks in parallel while using a shared representation (Caruana, 1997). While this is a simple definition, it does cover a vast portion of the possible problem space and allows for a lot of freedom in its interpretation.

In MTL there are two general ways in which we can define a problem and several more ways to apply an MTL approach with respect to the nature of the problem. If we are going to label an MTL problem by whether the tasks are sharing a label space or not, it can be either a Homogeneous MTL problem or a Heterogeneous MTL problem.

2.1.1. Homogeneous and Heterogeneous MTL

Homogeneous MTL problems are usually defined over existing multiclass classification problems that do not naturally consist of multiple tasks. This definition allows us to study the effects of MTL in a controlled environment, often competing against strong specialized STL benchmarks. Massimiliano and Theodoros in (Evgeniou and Pontil, 2004) demonstrate a homogeneous formulation by creating 139 binary tasks predicting whether a pupil belongs to a school or not. In a visual domain the same can be achieved with the the USPS dataset (Hull, 1994) as Kumar et al. show in (Kumar and Daumé III, 2012). Yang and Hospedales (Yang and Hospedales, 2016a) make use of the homogeneous formulation over the MNIST dataset (LeCun et al., 1998). In summary, with a homogeneous definition the target space is transformed to fit the MTL paradigm by splitting each target into a separate one vs. all task.

Heterogeneous MTL primarily occurs in problems and datasets that contain multiple types of labels existing in separate spaces for their data (Yang et al., 2009). For example, an artwork can have a label for the artist, the creation period, the type and materials of the artwork. All of those labels represent different tasks like artist attribution or creation period estimation (Strezoski and Worring, 2017a). Zhang et al (Zhang et al., 2014) demonstrate the same for facial landmarks and Zamir et al. (Zamir et al., 2018) show a heterogeneous formulation over multiple domains and target spaces at the same time.

Regardless of the problem formulation, following Caruana’s definition of MTL we know that a shared representation is key. But how do we decide what to share, to which extent and between which tasks should the sharing occur?

2.1.2. Hard and Soft Parameter Sharing

Most MTL approaches share the same base structure for feature extraction (Huang et al., 2018; Alami Mejjati et al., 2018; Zhang et al., 2014; Liu et al., 2015b; Zhang et al., 2016; Evgeniou and Pontil, 2004; Mrkšić et al., 2015; Kokkinos, 2017; Zamir et al., 2018; Liu et al., 2015a) and then continue to branch out, intertwine or widen the model’s parameter space. Sharing is an essential part of MTL and can be categorized as hard sharing or soft sharing.

Hard parameter sharing is the most commonly used approach for MTL in neural networks and goes back to (Caruana, 1998). It is generally applied by sharing the hidden layers between all tasks, while keeping several task-specific output layers. Hard parameter sharing greatly reduces the risk of over-fitting. In fact, Baxter (Baxter, 1997) showed that the risk of over-fitting the shared parameters is order N (where N is the number of tasks) smaller than over-fitting of the task-specific parameters, i.e. the output layers. This makes sense intuitively: the more tasks we learn simultaneously, the more our model has to find a general representation that is suited for all of the tasks and our chance of over-fitting on our original task is smaller.

In contrast, soft parameter sharing each task has its own model with separate parameters. This means that there is no physical layer structure that is shared between the different tasks, but rather it is an intermediate set of parameters that is shared. In this case, sharing parameters acts as a weight regularizer as Duong et al. show in (Duong et al., 2015). Further, Yang et al. in (Yang and Hospedales, 2016b) show that soft sharing regularization works on factorized representations (Tensor-Train (Oseledets, 2011) or Tucker (Tucker, 1966)) as well. This kind of sharing promotes task independence, but keeps the parameter space similar so that tasks can still influence each other. Similarly, Misra et al. (Misra et al., 2016) introduced Cross-Stitch Networks that rest on the soft sharing paradigm. Initially their approach starts with identical structures between tasks and soft parameter sharing. Different from (Yang and Hospedales, 2016b) and (Duong et al., 2015), in this model the sharing is determined by cross-stitch units, placed after pooling or fully connected layers, whose task is to learn a linear combination of the output of the previous layers from both structures. A shortcoming of these approaches is that they require knowledge of the parametric form of the tasks at hand. In (Alami Mejjati et al., 2018), Mejjati et al. treat their tasks as random variables for which statistical dependence can be measured and maximized. While it achieves comparable performance on several ranking and regression tasks, their approach relies on an existing dataset that mimics the distribution of the tasks. In this way, for this method to obtain the reported performance, prior knowledge of the distributions for each task is always necessary. Inspired by (Alami Mejjati et al., 2018), in Selective Sharing we aim to obtain this knowledge on the fly, while training the model in a greedy fashion accumulating such information as it becomes available in the back-propagated gradients.

2.1.3. Task Grouping

Choosing which tasks share parts of their parameter space proves to be beneficial when prior knowledge about this entanglement is present. Long and Wang (Long and Wang, 2015) introduced the concept of Deep Relationship Networks where they impose matrix priors on the shared fully connected layers which allows the model to learn the task relationships. This is a similar method to Bayesian MTL models (Daumé III, 2009; Marquand et al., 2014), who essentially group the tasks based on this predefined relationship matrix. This approach to grouping still relies on a predefined sharing structure which we avoid using Selective Sharing. Liu et al (Liu et al., 2017) propose a dynamic greedy bottom-up approach to task grouping to overcome the predefined sharing matrix problem. Instead of hard-coding task dependencies, they dynamically widen the model by creating new branches as training progresses. However, this method has a risk of every task becoming a separate branch in the structure and therefore limits sharing and parameter tuning between tasks. Other approaches have tried to generalize existing approaches to MTL like Yang et al. (Yang and Hospedales, 2016a), who uses tensor factorization to split the parameter space into shared and task specific parameters of every layer. Sluice networks introduced by Ruder et al. (Ruder et al., 2017) aggregate multiple MTL approaches into one by creating a task hierarchy in MTL problems to maximize relatedness utilization regardless of how related they actually are. Regardless of the scenario, task grouping methods tend to rely on prior task relationship knowledge, or extensive statistical analysis prior to model design.

In this paper we address some of the issues that arise with deep MTL models. Namely, we stray away from the predefined sharing structure and let the model decide the sharing parties and their structure. This allows for flexibility and dynamically adapting models that tune themselves to the tasks at hand. Learning the underlying task relatedness based on the supervision responses over a shared input, allows for grouping tasks on the fly during training. The resulting network then provides additional insight (explicit task relations) into the data we are analyzing.

3. Selective Sharing

In any deep learning system, gradients flow from the final layers of the model towards the starting ones carrying a corrective signal for the weights and biases along the way. They are a way for the model to know where and how much it should correct its trainable parameters. Selective Sharing is based on the assumption that the identically constructed task specific estimators, sharing the same input and feature extraction platform, would manifest a correlation between the back-propagated gradients for related tasks. Following this, we define an MTL problem as:

•

A set of tasks $T$ with cardinality $|T|$ .

•

A set of related tasks $R_{k}=T/t_{k}$ for each task $t_{k}\epsilon T$ .

•

A set of task dependent loss functions $E_{t}$ generating task specific gradients $G_{t}$ for the task specific targets $K_{t}$ .

•

A set of identical task specific estimators with identical layers $l_{t}$ and weights $w_{l_{t}}$ .111As all equations are task-specific, $t$ is omitted in further notation for simplicity and readability. Inputs, outputs, targets and activations follow the classic machine learning naming convention ( $x,\hat{y},y,a$ ).

Not having the constraint of requiring domain specific task relatedness knowledge, this approach allows for formulating any set of classification, regression or ranking problem into the MTL paradigm and can be used for discovering relations between contextual attributes in large bodies of data. For example, if we observe an MTL model trained for classifying written digits with one branch per digit, we can postulate that our optimization scheme will generate similar corrections for the branch specific weights in the branches related to classes ’1’, ’4’ and ’7’ (see Figure 5), in the case where the input image is the number seven. Having a shared input and distinct gradient flows, we can study the gradients and their factors which depict the behavior of task specific estimators, divulging information about intertask relatedness and supervision signal similarity. In this way, we define three functional structures of our approach: the feature extraction block, the shared representation block and the task specific estimators (branches) as illustrated in Figure 2.

The feature extraction block provides the initial representation for the input data. Depending on the problem and whether it should be trained or not, the feature extraction block can range in form from a CNN to SIFT, HOG or any other handcrafted features that apply. The features are then propagated forward through the hard shared layer, which is the base representation Selective Sharing is working with. From this point, task specific estimators branch out with identical blocks (identical layers $l_{t}$ with identical weights $w_{l_{t}}$ ) per task $t$ with $K_{t}$ targets, with a task specific loss function $E=\frac{1}{2}\sum_{k\in K}(\hat{y}_{k}-y_{k})^{2}$ for example Mean Squared Error, $\hat{y}_{k}$ being the activation of output unit $k$ with input $x_{k}$ , and $y_{k}$ as the ground-truth for the same units producing the task specific gradients $G_{t}$ .

Our approach is applicable in both heterogeneous and homogeneous problems. For homogeneous MTL problems (MNIST, CIFAR, OmniGlot, Birds) we define each category as a binary classification task. Our model changes accordingly with duplication of the estimator for each task. In a heterogeneous (OmniArt) problem we optimize for multiple different estimators suitable for the task at hand.

The name Selective Sharing is derived from an important property of the method itself, which is allowing the model to select the tasks where sharing should occur, without it being specifically programmed to do so. There is however, a general directive for the sharing. The directive is given before the model training starts and is essentially the clustering condition. Depending on the goal the model can perform sharing by:

•

Similarity maximizes the model’s exposure to data samples which have a high probability of sharing similar primitive features or high level semantics.

•

Dissimilarity allows for an increase in information entropy which can prolong training times, but results in better generalization.

•

Variance favors large task clusters so it strives towards learning a more complete representation.

3.1. Task specific gradient capture

The Selective Sharing pipeline starts by tapping into the gradient flow at the point where the model starts branching out for different tasks, immediately after the hard shared representation layer and task specific layer $l_{t}$ . In that point we define the gradient as $G_{t}$ and compute it as a gradient for weights $w_{ij}\in l_{t}$ as follows (bias excluded for simplicity):

[TABLE]

if we continue to decompose activation $x_{k}$ and define the gradient from the previous layer for task $t$ as $\delta_{tj}=g^{\prime}_{j}(x_{k})\sum_{k\in K_{t}}\delta_{k}w_{jk}$ , the gradient gets the final form of:

[TABLE]

Each $G_{t}$ tensor is a gradient from a separate task estimator. At the hard shared representation layer where $g_{j}$ is the activation function for node $j$ in layer $l_{t}$ , $w_{ij}$ are the weights connecting the node $j$ in layer $l_{t}$ with the node $i$ in layer $l_{t}-1$ and $a_{i}$ is the activation or output for node $i$ in layer $l_{t}$ . Using the same notation, we define $\delta_{k}$ as all the terms that involve the value of a unit at index $k$ with respect to the expected target $K_{t}$ and $z_{k}$ is the input to node $k$ of layer $l_{t}$ for task $t$ .

Since these captured gradients represent the data being clustered on basis of task relatedness, we can consider them as secondary features to the ones we learn for the data representations. Because the gradient capture is performed in a fully connected section between the hard shared representation and task specific branches, the gradient affecting the hard shared layer is:

[TABLE]

When observing a gradient in a fully connected section of the model, there is an influence between each pair of input and output nodes. Despite their high dimensionality, we collect and label these gradient tensors making the resulting group tensor a clustering candidate and name it after the index of the task specific estimator (e.g. in our digit recognition model a gradient coming from estimator seven would have the label seven).

Keeping in mind that this capture is performed between each task branch and the hard shared representation layer a rather compact representation with little information loss is necessary in order to perform the task correlation analysis. Additionally, this process is spanning through hundreds of examples passing in an epoch and the aggregated information loss can be substantial. For this reason we utilize the Tensor Train decomposition (Oseledets, 2011) which not only allows for a compact representation of tensors, but eases the application of linear algebra operations.

3.2. Task specific gradient factorization

Depending on the dimensions (number of units) of the layers where they are calculated, gradients can become high dimensional tensors. This is particularly visible in MTL pipelines with multiple branches, as they all generate their own gradients. This type of rich representation is informative, but poses a problem in a setting where aggregation and quick analysis is required. Tensor train decomposition (Oseledets, 2011) is a mechanism that transforms a given tensor $G_{t}$ with $d$ elements $G_{t}(f_{1},f_{2},...,f_{d})$ into a more compact representation $B_{t}$ :

[TABLE]

where each of the $D_{i}$ elements are matrices representing a factor of the tensor $G_{t}$ and the $\alpha_{f}$ represent the index matrix of dimension (factor) $f$ . If we aim to reconstruct the whole tensor a multiplication needs to be performed between all of the factors and then a summation of the auxiliary $\alpha$ index nodes. In a reconstruction setting these operations are performed for as many times as there are dimensions in the decomposed tensor. From them, branch specific matrices $C_{t}$ are produced with dimensions $[num\_batches\cdot shape(B_{t})]$ where $t$ is the index of the task.

[TABLE]

The $C_{t}$ from equation 5 are matrices which after being normalized continue to the clustering phase of Selective Sharing where the task groups are defined.

3.3. Factor clustering

Due to their high dimensionality, density and the critical time period in which clustering occurs, a robust fast high dimensional clustering approach is required. Robust to noise and efficient in handling dense representations, HDBSCAN (McInnes and Healy, 2017; Campello et al., 2013) is the clustering approach we utilize for our method. It uses a simple distance metric which allows for fast cluster formation and easy selection:

[TABLE]

where $core_{k}$ is the distance to the $k-th$ nearest neighboring factor and provides a sort of density estimation, $d(C_{t_{i}},C_{t_{j}})$ is the original metric distance between $C_{t_{i}}$ and $C_{t_{j}}$ (see Figure 4). This distance metric is called mutual reachability distance. Under this metric, dense points (with low core distance) remain the same distance from each other whereas sparser points are pushed away to be at least their core distance away from any other point. Its simple definition (eq. 6) allows for distance metric manipulations so we can effectively alter the task group formation conditions when computing the $R_{t}$ sets of related tasks.

Clustering is performed over the stacked and normalized gradient factors per task at the end of each epoch. This process repeats until all of the tasks have formed a group with at least one other task, or there have been no changes to the model’s architecture for three 222after experimenting with two, three, four and five epochs we empirically determined the number three as most suitable. As clustering is expensive performing it in states where no group formation is possible is redundant. consecutive epochs. If no stop criteria exists, as training progresses, errors get smaller and gradients sparser, it is expected behavior that all tasks form one group, reverting back to a hard shared MTL approach defying the purpose of grouping in the first place.

3.4. Branch merging and recalculation

When tasks form a group in the clustering phase, it implies that they have similar reactions to the same input, therefore the weights of the layers in their branches should be in a similar state at that point in time. The starting condition of all branches having identical initial states supports this hypothesis. With identical architectures per branch, merging can be performed by regular arithmetic operations in the order of keeping the pairwise maximum, minimum and mean weights, or just keeping the branch whose aggregated loss was the lowest in the newly formed group. For the experimental design we keep the branch with the lowest aggregated loss with its weights and parameters intact.

4. Experimental Design

We evaluate Selective Sharing on several problems, both homogeneous and heterogeneous ones. By performing various classification, regression and ranking tasks on five datasets we aim to:

•

Illustrate the reasoning behind the method and the group formation logic both visually and intuitively on simple datasets (MNIST and CIFAR10)

•

Evaluate group formation handling when optimizing for a very large number of tasks (50) (OmniGlot)

•

Evaluate performance in ranking problems and compare against classic and current MTL approaches (UCSD-Birds, OmniGlot)

•

Demonstrate performance in a real large-scale multi-modal MTL problem (OmniArt)

4.1. Datasets and Tasks

The MNIST (LeCun et al., 1998) handwritten digit classification problem and CIFAR10 (Krizhevsky, 2009) are well established benchmarks containing 70K and 60K images. For both experiments we split the data into 50K (MNIST) and 40K (CIFAR10) train, 10K validation and 10K test images spanning 10 target classes. In the CIFAR10 experiment we normalized the images with the training dataset mean. For MNIST we did not use any augmentation or preprocessing steps. As setup in a homogeneous setting we treat the ten target classes as ten binary classification tasks with the standard train/test splits.

OmniGlot (Lake et al., 2015) contains 1,623 handwritten from 50 alphabets. Each character is drawn in square regions which we resized to 28x28 pixels and the model is tasked with classifying the alphabet of origin for the input character resulting in a 50 task MTL problem.

Caltech-UCSD Birds dataset (Wah et al., 2011) provides 11.788 bird images over 200 bird species with 312 binary attribute annotations. For state of the art comparison, we compare on ten target attributes obtained with spectral clustering using the FSIC as the similarity measure (Alami Mejjati et al., 2018). For the ten selected attributes, 10 MTL ranking tasks are defined. For each target attribute (ranking task) we rank pairs of images based on the estimated presence of that attribute on 10% of the dataset. The remaining 80% is used for training and 10% used for validation.

OmniArt (Strezoski and Worring, 2018) provides the data for a comprehensive real large-scale multi-modal MTL problem. This artistic dataset features over 2M data samples and presents a natural heterogeneous MTL problem described by multiple interconnected contextual attributes, making it ideal for testing our method. For the purpose of this experiment we select a subset of OmniArt containing only artworks of the type painting. Having a consistent artwork type allows us to better illustrate the connection between the tasks. Our selected subset consists of 133K paintings for which we are using a 80-10-10 split for training, validation and testing and compare against an MTL baseline introduced with the dataset itself (Strezoski and Worring, 2017a).

4.2. Model setup and Training

For MNIST we define a two layer CNN with 20 5x5 filters in the first layer, and 40 filters of size 5x5 in the second. The convolutions are batch normalized and followed by max pooling layers with stride two. As a feature cutoff point, we apply a 640 unit fully connected (FC) layer. For STL this layer is followed by a sequence of 100 unit and 50 unit FC layers ending in a ten unit softmax output layer (Ciresan et al., 2011). For MTL, the 640 unit hard shared representation is followed by the same classification block from the STL setup, duplicated ten times. We replace the ten-way softmax output with one sigmoid unit per block that performs binary classification. For binary classification we use binary cross-entropy and in STL approaches - categorical cross-entropy as the loss function.

For CIFAR10 we train a VGG-16 network from scratch with an average pooling cutoff at the end. For STL, the average pooling is followed by 1024 and two 100 dimensional FC layers with a 10-way softmax at the end. For the MTL experiment we rely on the 1024 dimensional FC layer as the hard shared layer and define 10 task specific branches with two 100 unit FC layers and a sigmoid unit for binary classification.

For Caltech-UCSD Birds 200 (CUB-200), the tasks are posed as ten ranking problems, each ranking pairs of images according to the predicted presence of the attribute in the input images. A pretrained VGG-16 network constitutes the feature extraction unit, followed by a 2048 dimensional FC layer as the hard shared representation. Our 10 ranking blocks have two FC layers of 512 and 256 units followed by an output unit. We train our ranking blocks with a paired ranking loss function:

[TABLE]

For OmniGlot we train the same three layer CNN from (Yang and Hospedales, 2016a) with a difference in the classification setup and input dimensions (resized to 28x28px), however, we define 50 tasks and do not make a character level distinction. For the STL experiments we define a FC layer of 300 units, two FC layers of 100 units with 0.2 dropout and a 50 class softmax layer at the end. For the MTL experiments we rest on the same 300 unit hard shared FC layer and define 50 task specific estimators. The task estimators have the STL estimator design, but perform binary classification at the end.

For MNIST, CIFAR, OmniGlot and UCSD-Birds we optimize with Stochastic Gradient Descent with momentum. The MNIST, CIFAR and OmniGlot experiments start with a batch size of 64, learning rate of 0.02 and momentum of 0.5 (Ciresan et al., 2011). The UCSD-Birds experiment optimizes for a batch size of 32, learning rate of 0.01 (Evgeniou and Pontil, 2004). All experiments ran for 30 epochs and testing is done with the best performing validation model.

In the OmniArt setup, it is important that we maintain consistent features over all experiments as it is a complex dataset. For this we trained a deep variational autoencoder (Hou et al., [n. d.]; Kingma and Welling, 2013) on the training set, which we continue to use as a feature extractor for all STL and MTL experiments. As the estimator, for both the STL and MTL experiments we use the same structure after the hard shared representation layer of 2048 units. The estimators consist of two 300 unit FC layers, 0.2 dropout and softmax for classification or sigmoid for regression outputs. For the STL experiments we train this structure independently for all seven tasks. In the MTL experiments, it comprises the task specific estimators (branches). We define seven tasks of interest, namely artist attribution (AA), genre classification (GC), creation century classification (CCC), school classification (SC), medium classification (MC), style classification (STC) and creation year estimation (CYE). Different from the MNIST, CIFAR10 and OmniGlot datasets, our tasks are not binary classification tasks. Each one has a different number and type of targets.

For the OmniArt experiments we use the Adam optimizer (Kingma and Ba, 2014) with learning rate of 0.001, batch size of 100 and train for 50 epochs. Unless specified, all nonterminal FC units use the ReLU (Nair and Hinton, 2010) activation function. Testing is done with the best performing validation models.

4.3. Results and Discussion

Our experimental design explores multiple sharing paradigms for the five datasets. The reported results show the performance with Selective Sharing with Similarity as the best performing approach with Figure 5 showing the formed groups for which performance is shown in Tables 1 and 3 333Group formations, results with all sharing criteria and training details are available in the supplementary material..

Sharing with similarity yields an accuracy of 99.0% for MNIST. For CIFAR10, sharing with similarity produced an average per class (per task) accuracy of 92.7% improving upon both the STL and MTL baseline approaches. Selective Sharing with similarity (expectedly) yields the best results (MNIST, CIFAR and OmniGlot), partly because similar letter and number types share the same constructional primitives (see Figure 6). In that way, learning similar alphabets together is expected to improve the generalization capabilities of the model. Using a superior feature extraction method, data augmentation and parameter tuning can improve the performance in both tasks, but is beyond the purpose of the experiments.

For the UCSD-Birds homogeneous ranking problem we compare against a vanilla MTL model, two strong baselines (Evgeniou and Pontil, 2004; Lee et al., 2016) and one state-of-the-art approach (Alami Mejjati et al., 2018) in ranking the images containing ten selected bird attributes. Selective Sharing with similarity outperforms all other approaches in seven out of ten tasks.

Sharing with similarity produced the best results for the OmniArt dataset overall, while sharing with between cluster variance mostly boosted performance for AA and STC tasks. The consistent improvement of sharing with similarity over the MTL baseline can be observed in Table 3. E.g. the Dutch school of painting was active in the 1400s - late 1500s. Most of the artists from this period in our OmniArt subset are Dutch masters, which implies that knowing the period and school narrows down the list of possible artists significantly. We could have easily discovered this relation by mining the data itself, but the quadruple correlation in sharing with variance between artists, creation periods, schools and styles is quite intricate and model driven.

A qualitative observation from the activation region analysis is that the representations learned with selective sharing are more general and feathered than their STL or MTL counterparts. In addition, as visible the second rows of Figure 6, other than the larger activation span we can observe a similarity between the class activation regions for tasks that formed groups. For example the group formed between ’2’,’3’ and ’5’ in the MNIST tasks exhibits a consistent activation on the curve of the digit in all three tasks. A similar activation can be observed in the below neckline and shoulder area in the Artist, Century, School group in the OmniArt analysis.

5. Conclusions

Selective Sharing is a multi-task learning approach for images in context that allows exploiting task relatedness for group formation without any predefined intertask dependency. Group formation is performed using conditional statements on distances between clusters of task specific gradient factors. This grouping procedure implicitly reduces the trainable parameter space dimensionality and boosts predictive performance for related tasks (or contextual attributes) as a result (see Tables 1, 2, 3). Moreover, Selective Sharing is modality invariant, making it applicable in every multimedia scenario where a supervision signal and multiple targets are available. As such, applied on a wide range of MTL problems, Selective Sharing adjusts the representation learning process to implicitly exploit the entanglement between the contextual attributes and supports learning robust shared representations.

On a general note, experimental results on images in context show that forming groups of tasks over which we build a mutual representation is beneficial to the overall learning process. Figure 6 illustrates that the features we obtain are more robust and general in comparison to STL and MTL baselines. Furthermore, the way a model defines its groups, affects the performance in specific ways. E.g. sharing with similarity can improve tightly interconnected tasks, but sharing with variance can improve the overall performance as the latent data representation is built with respect to more contextual attributes that correlate and hold exclusive data insight. Compared to conventional matrix driven MTL approaches, Selective Sharing provides an additional degree of freedom in modeling the tasks at hand, as between task relations need not be studied prior to the classification architecture design. Moreover, the intuition and logic behind the approach is simple and easy to grasp without the need for complex mathematical or statistical apparatus, making Selective Sharing a versatile and useful approach to MTL. Conclusively, sharing is a virtue and knowing who to share with - is awareness. We attempt to teach our models how to do both.

Acknowledgements.

The authors would like to thank Pascal Mettes for his feedback. This research is supported by the VISTORY project NWO award number 628.007.004.

Appendix A Supplemental Material

In this supplemental material we present an expansion of the experimental design and the results obtained from it. We reproduce some content of the main paper to make this supplemental more self-contained.

A.1. Architectural Details

This section contains a graphical overview of the defined models for easy understanding and reproduction of the experimental pipeline as well as tabular hyper-parameter definition. On the left side of Figures 7, 8, 9 , 10 and 11 we depict the STL model, and on the right the MTL counterpart. The number between the task specific estimators is the amount of times the estimator is duplicated and each block is color coded by its functionality:

•

Yellow - BatchNormalization

•

Orange - A complete feature extraction block or known architecture

•

Dark Orange - DropOut

•

Red - 2D Max Pooling

•

Blue - 2D Convolutions

•

Green - Shared Layers (Linear Fully Connected)

•

Purple - Task Specific Linear Fully Connected Layers

•

Grey - Task specific output layer

In Table 4 we show the complete hyperparameter and data setup for the defined STL and MTL experiments. All experiments on MNIST, CIFAR10, OmniGlot, OmniArt and Birds were run on a single nVidia Titan-X GPU with PyTorch version 0.3.1 on CUDA 7.5.

A.2. Extended Set of Results

In this section we discuss the obtained results with multiple sharing conditions, the effects Selective Sharing has on training duration, the reason for the early stopping mechanism and the possibility of further refining the clustering process. Our expanded result set on multiple datasets provides insight into how the sharing conditional affect performance in different data types.

While in the main paper we report only the best obtained results (from Sharing with Similarity), we performed analysis on Selective Sharing with 4 sharing conditionals:

•

Sharing with similarity

•

Sharing with dissimilarity

•

Sharing with variance

•

Sharing with random assignment

Each of these conditionals has a different effect on the learning process and produces different representations based on the groups of tasks being learned.

In the case of sharing with similarity, a visual inspection of the input images shows that there is a correlation between the artist, the creation century and their school of origin. E.g. the works of Rembrandt and Modigliani in the OmniArt dataset have a oil and canvas in almost 80% of the cases, which implies that knowing the artist narrows down the list of mediums significantly and vice versa. A similar conclusion can be deduced from sharing with similarity in the MNIST, OmniGlot and CIFAR-10 datasets. In MNIST the group containing ’8’, ’6’ and ’9’ shares the closed circle in each digit, while the group containing ’2’,’3’,’5’ is sharing the open loop on the bottom of ’5’, top of ’2’ and mid-section of ’3’.

For sharing with between cluster variance, the results show that for OmniArt it is better to build a representation with respect to a larger number of tasks rather than one formed from smaller task groups. This phenomenon can be attributed to the tasks initially having a high correlation. E.g. individual artists have contributed to a limited amount of styles and periods, on chosen mediums, from their own school of painting, which encapsulates their entire descriptive metadata. Also it is quite intuitive that sharing with variance ensures the learned representation will encompass as much relevant information as possible.

Sharing with dissimilarity is motivated by the fact that an increased information entropy in the feedback signal would produce more robust and general features. Opposite of sharing with similarity, we train a smaller more general representation for the group of dissimilar tasks. If we observe the group formed by ’Frogs’ and ’Airplanes’ in the CIFAR-10 experiment, this branch of the network would have learned a more general representation then the ’Boat’ and ’Airplane’ group. Just by being exposed to images with different contexts, in theory, improves the overall generalization properties of the branch.

We use sharing randomly as a control signal to which we compare performances, it uses randomly generated sequences to select clusters and then checks if a task is duplicated (duplicates are removed). A constant improvement over this method in all cases further confirms the beneficial effects of selective sharing in MTL models.

This threshold can be a either a user defined, or a trainable parameter. A user defined threshold would imply that prior knowledge has been introduced to the system which no longer makes Selective Sharing a no prior knowledge MTL method. However, there a ways to monitor and learn the performance threshold by following individual and group task performances during training.

In our case, our empirical study shows that it is best to start applying Selective Sharing in the first epochs after the model starts learning. At this point in time the gradients are more stable and consistent, which acts as a regularizer for the extracted factors and improves our clustering performance. If Selective Sharing is applied very late in the training process, the clustering mechanism should be much more sensitive to compensate for the minute differences in the gradient factors.

A.3. Training duration with Selective Sharing

Selective sharing involves additional steps during the training procedure, this is the factorization and clustering processes. While in the start of the training process it increases the duration of an epoch, as the model reduces in parameter space epoch duration decreases multiple times. After the early stopping mechanism takes over and a final model architecture is defined, epoch duration stabilizes and is always lower than conventional MTL approaches.

Compared to a signle STL model performing one task over a 10-way softmax, MTL will always be slower, and so will Selective Sharing.

Bibliography54

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Alami Mejjati et al . (2018) Youssef Alami Mejjati, Darren Cosker, and Kwang In Kim. 2018. Multi-task Learning by Maximizing Statistical Dependence. In Proceedings of CVPR .
3Baxter (1997) Jonathan Baxter. 1997. A Bayesian/Information Theoretic Model of Learning to Learn via Multiple Task Sampling. Machine Learning 28, 1 (01 Jul 1997), 7–39.
4Campello et al . (2013) Ricardo JGB Campello, Davoud Moulavi, and Jörg Sander. 2013. Density-based clustering based on hierarchical density estimates. In Pacific-Asia conference on knowledge discovery and data mining . Springer, 160–172.
5Caruana (1997) Rich Caruana. 1997. Multitask Learning. Machine Learning 28, 1 (01 Jul 1997), 41–75. https://doi.org/10.1023/A:1007379606734 · doi ↗
6Caruana (1998) Rich Caruana. 1998. Multitask learning. In Learning to learn . Springer, 95–133.
7Ciresan et al . (2011) Dan Claudiu Ciresan, Ueli Meier, Luca Maria Gambardella, and Jurgen Schmidhuber. 2011. Convolutional neural network committees for handwritten character classification. In Document Analysis and Recognition (ICDAR), 2011 International Conference on . IEEE, 1135–1139.
8Daumé III (2009) Hal Daumé III. 2009. Bayesian multitask learning with latent hierarchies. In Proceedings of the Twenty-Fifth Conference on Uncertainty in Artificial Intelligence . AUAI Press, 135–142.