Visual pathways from the perspective of cost functions and multi-task   deep neural networks

H. Steven Scholte; Max M. Losch; Kandan Ramakrishnan; Edward H.F. de; Haan; Sander M. Bohte

arXiv:1706.01757·q-bio.NC·October 16, 2017

Visual pathways from the perspective of cost functions and multi-task deep neural networks

H. Steven Scholte, Max M. Losch, Kandan Ramakrishnan, Edward H.F. de, Haan, Sander M. Bohte

PDF

1 Repo

TL;DR

This paper proposes a computational framework using multi-task deep neural networks to understand the functional organization of the visual pathways, highlighting how task relatedness influences feature sharing and pathway specialization.

Contribution

It introduces a novel method to measure unit contributions to tasks in multi-task networks and applies it to analyze visual pathway organization based on task relatedness.

Findings

01

Unrelated tasks show decreasing feature sharing in higher layers.

02

Related tasks maintain high feature sharing across layers.

03

Method can potentially analyze biological visual system organization.

Abstract

Vision research has been shaped by the seminal insight that we can understand the higher-tier visual cortex from the perspective of multiple functional pathways with different goals. In this paper, we try to give a computational account of the functional organization of this system by reasoning from the perspective of multi-task deep neural networks. Machine learning has shown that tasks become easier to solve when they are decomposed into subtasks with their own cost function. We hypothesize that the visual system optimizes multiple cost functions of unrelated tasks and this causes the emergence of a ventral pathway dedicated to vision for perception, and a dorsal pathway dedicated to vision for action. To evaluate the functional organization in multi-task deep neural networks, we propose a method that measures the contribution of a unit towards each task, applying it to two networks…

Tables1

Table 1. Table 1: Classification errors. Comparison of the error rates of RelNN and UnrelNN on a validation set of 11,800 images. The Top-5-error is defined as the correct prediction not being under the 5 most likely predictions. Both models were trained for 90 90 90 epochs until convergence with Nesterov accelerated gradient descent Nesterov ( \APACyear 1983 ) with momentum of 0.9 0.9 0.9 , starting with a learning rate of 0.01 0.01 0.01 and decreasing it every 30 30 30 epochs by a factor of 10 10 10 .

Top-5-error

Subordinate-level

recognition

Ordinate-level/Text

recognition

Chance

97.9%

66.7%

RelNN

14.0%

2.9%

UnrelNN

15.2%

4.9%

Equations19

p (y ∣ x, Θ_{∖ θ}) = θ \sum p (y ∣ x, Θ) p (θ)

p (y ∣ x, Θ_{∖ θ}) = θ \sum p (y ∣ x, Θ) p (θ)

p (y ∣ x, Θ_{∖ θ})

p (y ∣ x, Θ_{∖ θ})

= \frac{\int _{θ} p ( y ∣ x , Θ ) p ( x , Θ _{∖ θ} ) p ( θ ) d θ}{p ( x , Θ _{∖ θ} ) \int _{θ} p ( θ ) d θ}

= \int_{θ} p (y ∣ x, Θ) p (θ) d θ

p (y ∣ x, Θ_{∖ θ}) = θ \sum p (y ∣ x, Θ) p (θ)

p (y ∣ x, Θ_{∖ θ}) = θ \sum p (y ∣ x, Θ) p (θ)

θ

θ

so that p (θ)

o dd s (z)

o dd s (z)

W E_{θ} (y ∣ x, Θ)

- l o g_{2} (o dd s (y ∣ x, Θ_{∖ θ}))

C_{θ} (y ∣Θ)

T C_{θ} (t ∣Θ) = \frac{1}{K} k = 1 \sum K C_{θ} (y_{k} ∣Θ)

T C_{θ} (t ∣Θ) = \frac{1}{K} k = 1 \sum K C_{θ} (y_{k} ∣Θ)

t \in T a s k s, K \in ∣ o u tp u t s_{t} ∣

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Repositories

mlosch/FeatureSharing
pytorchOfficial

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Visual pathways from the perspective of cost functions and multi-task deep neural networks

H.Steven Scholte1,2,∗ Max M. Losch1,2,3,∗ Kandan Ramakrishnan3

Edward H.F. de Haan1,2 Sander M. Bohte4

∗ Shared first author

1Department of Psychology, University of Amsterdam, The Netherlands

2Amsterdam Brain and Cognition, University of Amsterdam, The Netherlands

3Informatics Institute, University of Amsterdam, The Netherlands

4Machine Learning Group, CWI, Amsterdam, The Netherlands

{h.s.scholte, m.m.losch, k.ramakrishnan, e.h.f.dehaan}@uva.nl, [email protected]

Abstract

Vision research has been shaped by the seminal insight that we can understand the higher-tier visual cortex from the perspective of multiple functional pathways with different goals. In this paper, we try to give a computational account of the functional organization of this system by reasoning from the perspective of multi-task deep neural networks. Machine learning has shown that tasks become easier to solve when they are decomposed into subtasks with their own cost function. We hypothesize that the visual system optimizes multiple cost functions of unrelated tasks and this causes the emergence of a ventral pathway dedicated to vision for perception, and a dorsal pathway dedicated to vision for action. To evaluate the functional organization in multi-task deep neural networks, we propose a method that measures the contribution of a unit towards each task, applying it to two networks that have been trained on either two related or two unrelated tasks, using an identical stimulus set. Results show that the network trained on the unrelated tasks shows a decreasing degree of feature representation sharing towards higher-tier layers while the network trained on related tasks uniformly shows high degree of sharing. We conjecture that the method we propose can be used to analyze the anatomical and functional organization of the visual system and beyond. We predict that the degree to which tasks are related is a good descriptor of the degree to which they share downstream cortical-units.

1 Introduction

The visual system is described as consisting of two parallel pathways. Research by Gross, Mishkin and colleagues, integrating insights from lesion Newcombe (\APACyear1969) and anatomical studies Schneider (\APACyear1969), showed that these pathways emerge beyond the striate cortex with one involved in the identification of objects projecting ventrally, and the other involved in localization of objects, projecting to the parietal cortex Gross \BBA Mishkin (\APACyear1977); Mishkin \BOthers. (\APACyear1983). From the start of the dual-pathway theory, multiple pathways were believed to be computationally efficient Gross \BBA Mishkin (\APACyear1977). Support for this idea comes from research using artificial networks with one hidden layer, showing that location and identity are better learned when units in the hidden layers are uniquely assigned to one of these functions Rueckl \BOthers. (\APACyear1989); Jacobs \BOthers. (\APACyear1991).

In the early nineties, Goodale & Milner argued that, on the basis of neuropsychological, electrophysiological and behavioural evidence, these pathways should be understood as have different goals. The ventral pathway (“vision for perception”) is involved in computing the transformations necessary for the identification and recognition of objects. The dorsal pathway (“vision for action”) is involved in sensorimotor transformations for visually guided actions directed at these objects Goodale \BBA Milner (\APACyear1992).

It was recently suggested that the brain uses a variety of cost functions for learning Marblestone \BOthers. (\APACyear2016). These cost functions can be highly diverse. The brain must optimize a wide range of cost functions, such as keeping body temperature constant or optimizing future reward from social interactions. High-level cost functions, by necessity, also shape other cost functions that determine the organization of perception: a cost function that is being optimized to minimize hunger affects the visual recognition cost function as foods have to be recognized. Mechanistically, this could take place directly through, for instance, a reward modulation of object recognition learning, or indirectly through evolutionary pressure on the cost function associated with object recognition learning. In this paper, we try to understand how multiple pathways in the visual cortex might evolve from the perspective of Deep Neural Networks (DNNs) (see box 1) and cost functions (see box 2), and what this implies for how object information is stored in these networks.

We start with a discussion of the relevance of DNNs LeCun \BOthers. (\APACyear2015); Schmidhuber (\APACyear2015) and, following Marblestone Marblestone \BOthers. (\APACyear2016), of cost functions for understanding the brain in section 2. We extend our discussion with the importance of optimizing different cost functions simultaneously, presenting a hypothesis on the relationship between relatedness of tasks and the degree of feature representation sharing.

We test this hypothesis in a computational experiment with DNNs in section 3 to evaluate how much its feature representations contribute to each task. In section 4, we discuss the degree to which we are able to translate our experimental findings to the division between the ventral and dorsal pathway, the multiple functions of the ventral cortex, and the apparent co-occurrence of both distributed and modular representations related to object recognition.

We finish this paper with a discussion of how this framework can be used experimentally to understand the human brain while elaborating on the limitations of DNNs and cost functions. For brevity, we do not consider models of re-current processing.

2 Multi-task DNNs as models of neural information processing in the brain

Artificial neural networks are inspired by computational principles of biological neuronal networks and are part of a large class of machine learning models that learn feature representations from data by optimizing a cost function. In this section, we discuss why we believe models based on optimizing cost functions, such as DNNs, are relevant for understanding brain function.

2.1 Similarities in architecture and behavior between DNNs and the brain

Alexnet Krizhevsky \BOthers. (\APACyear2012), a model that is has been used extensively in research relating DNN’s to the brain, consists of 7 layers (see box 1). The first layer consists of filters with small kernels that are applied to each position of the input. In the subsequent four layers this procedure is repeated using the output of the preceding layer. This results in an increase in receptive field (RF) size and concurrently an increase in the specificity of tuning Zeiler \BBA Fergus (\APACyear2014). This increase of receptive field size and tuning specificity traversing the layers resemble the general architecture of feed-forward visual representations in the human brain Lamme \BBA Roelfsema (\APACyear2000); DiCarlo \BOthers. (\APACyear2012).

A number of BOLD-MRI studies have revealed that the neural activation’s in early areas of visual cortex show the best correspondence with the early layers of DNNs and that higher-tier cortical areas show the best correspondence with higher-tier DNN layers Güçlü \BBA van Gerven (\APACyear2015); Eickenberg \BOthers. (\APACyear2017). MEG/EEG studies have furthermore shown that early layers of DNNs have a peak explained variance that is earlier than higher-tier DNN layers Cichy \BOthers. (\APACyear2016); Ramakrishnan \BOthers. (\APACyear2016). In addition, the DNN model has been shown to predict neural responses in IT, both from humans and macaque, much better than any other computational model Khaligh-Razavi \BBA Kriegeskorte (\APACyear2014); Yamins \BOthers. (\APACyear2014).

A number of BOLD-MRI studies have revealed that the neural activations in early areas of the visual cortex show the best correspondence with the early layers of DNNs and that higher-tier cortical areas show the best correspondence with higher-tier DNN layers Güçlü \BBA van Gerven (\APACyear2015); Eickenberg \BOthers. (\APACyear2017). MEG/EEG studies have furthermore shown that early layers of DNNs have a peak explained variance that is earlier than higher-tier DNN layers Cichy \BOthers. (\APACyear2016); Ramakrishnan \BOthers. (\APACyear2016). In addition, the DNN model has been shown to predict neural responses in IT, both from humans and macaque, much better than any other computational model Khaligh-Razavi \BBA Kriegeskorte (\APACyear2014); Yamins \BOthers. (\APACyear2014).

The correspondence between DNNs and the brain begs the question of the degree to which DNNs show ‘behavior’ similar to humans. Early results indicate that humans and DNNs have a similar pattern of performance in terms of the kinds of variation (size, rotation) that make object recognition harder or simpler Kheradpisheh \BOthers. (\APACyear2016). It has also been shown that higher-tier layers of DNNs follow human perceptual shape similarity while the lower-tier layers strictly abide by physical similarity Kubilius \BOthers. (\APACyear2016). On the other hand, DNNs are, for instance, much more susceptible to the addition of noise to input images than humans Jang \BOthers. (\APACyear2017) and the exact degree to which the behavior of DNNs and humans overlap is currently a central topic of research.

As others Kriegeskorte (\APACyear2015); Yamins \BBA DiCarlo (\APACyear2016), we therefore believe that there is a strong case that DNNs can serve as a model for information processing in the brain. From this perspective, using DNNs to understand the human brain and behavior is similar to using an animal model. Like any model, it is a far cry from a perfect reflection of reality, but it is still useful, with unique possibilities to yield insights in the computations underlying cortical function.

2.2 Cost functions as a metric to optimize tasks

While deep neural networks offer the representational power to learn features from data, the actual learning process is guided by an objective that quantifies the performance of the model for each input-output pair. Common practice in machine learning is to express such an objective as a cost function Domingos (\APACyear2012). As Marblestone and colleagues argue, the human brain can be thought of implementing something very similar to cost functions to quantify the collective performance of neurons and consequently to steer the learning of representations in a direction that improves a global outcome Marblestone \BOthers. (\APACyear2016).

2.3 Problem simplification by task decomposition

While humans may act under a grand evolutionary objective of staying alive long enough to reproduce, we accomplish many small-scale objectives along the way, like guiding our arms to our mouth to eat or plan our path through the city. Each of these smaller objectives can be thought of as being governed by their own cost functions (see figure 1). These could be embedded in the brain, either hard coded into the neural substrate by evolution, by sovereign decision making, or as part of meta-learning: learning to learn Baxter (\APACyear1998).

While humans may act under a grand evolutionary objective of staying alive long enough to reproduce, we accomplish many small-scale objectives along the way, like guiding our arms to our mouth to eat or plan our path through the city. Each of these smaller objectives can be thought of as being governed by their own cost functions (see figure 1). These could be embedded in the brain, either hard-coded into the neural substrate by evolution, by sovereign decision making, or as part of meta-learning: learning to learn Baxter (\APACyear1998).

It has been argued that a task becomes easier to solve if it can be decomposed into simpler tasks Jacobs \BOthers. (\APACyear1991); Sutton \BOthers. (\APACyear1999). To support their argument they state that the simple problem of learning the absolute value function can be decomposed into learning two linear functions and a switching function, which leads to a model with fewer parameters that can be trained faster. While such a decomposition could be predefined through the neural substrate, they observe in their experiments that such a decomposition can naturally arise from competitive learning, if the same set of parameters are optimized for multiple tasks. As the decomposition of tasks is underdetermined, the learner may come up with different decompositions, each time it is trained.

The notion of decomposition has been frequently used in machine learning literature on reinforcement learning Dietterich (\APACyear2000) to increase learning speed and enable the learning of task-local optima that can be reused to learn a superordinate goal. Very often it is even impossible to specify the objective for a complex task so that it is a necessity to decompose it into tractable partial objectives. An example is the objective of vision. Finding an objective for such a broad and vague task appears futile so that it is easier to define a subset of tasks like figure ground segmentation, saliency and boundaries. A noteworthy implementation of such a decomposition is the recent DNN ‘Uber-Net’ Kokkinos (\APACyear2016), which solves 7 vision related tasks (boundary, surface normals, saliency, semantic segmentation, semantic boundary and human parts detection) with a single multi-scale DNN network to reduce the memory footprint. It can be assumed that such a multi-task training improves convergence speed and better generalization to unseen data, something that already has been observed on other multi-task setups related to speech processing, vision and maze navigation Dietterich \BOthers. (\APACyear1990, \APACyear1995); Bilen \BBA Vedaldi (\APACyear2016); Mirowski \BOthers. (\APACyear2016); Caruana (\APACyear1998).

3 Functional organization in multi-task DNNs

One hypothesis for the emergence of different functional pathways in the visual system is that learning and development in the cortex is under pressure of multiple cost functions induced by different objectives. It has been argued that the brain can recruit local populations of neurons to assign local cost functions that enable fast updating of these neurons Marblestone \BOthers. (\APACyear2016). We explore in this section the ramifications of multiple cost functions acting on the same neurons by translating the problem to instances of multi-task DNNs sharing the same parameters. By observing the contributions each feature representation in a DNN has to each task, we will draw conclusions about the functional separation we observe in the visual cortex in section 4.

3.1 Hypothesis

Given two cost functions that optimized two related tasks, which both put pressure on the same set of parameters, we conjecture that the parameters learned will be general enough to be used for both tasks (see figure 2B). In contrast, we speculate that, when the tasks are unrelated, two subsets of parameters will emerge during learning that each lie within their task-respective feature domain (see figure 2C). Because the amount of feature representation sharing is determined by the relation between tasks, and ultimately by the statistics of the credit assignments, we predict an upper to lower tier gradient of feature representation sharing with the least sharing in higher tier layers.

3.2 Training models for multiple tasks

We test this hypothesis on feature representation sharing with DNNs trained for two tasks simultaneously. We construct two example setups involving a pair of related tasks (which we call RelNN), namely the simultaneous classification of ordinate and subordinate categories of objects in images, and a pair of unrelated tasks (which we call UnrelNN) namely the classification of objects and text labels in images (see figure 3). As the relatedness of tasks is not clearly defined and an open problem Caruana (\APACyear1998); Zhang \BBA Yeung (\APACyear2014), the tasks were selected based on the assumption that text recognition in UnrelNN is mostly independent of object recognition while in contrast ordinate level classification in RelNN is highly dependent on the feature representations formed for subordinate level classification.

3.2.1 Training setup

Both setups were implemented by training a version of AlexNet Krizhevsky \BOthers. (\APACyear2012) on approximately half a million images from the ImageNet database Russakovsky \BOthers. (\APACyear2015) each 111The code, data and pretrained models are available here:

https://github.com/mlosch/FeatureSharing. To optimize the models for two tasks simultaneously, the output layer of AlexNet was split into two independent layers. Both models were trained on an identical set of images consisting of 15 ordinate classes further divided into 234 subordinate classes, each image augmented with an overlay of 3 letter labels from 15 different classes (see figure 3a). The overlays were randomly scaled, colored and positioned while ensuring that the text is contained within the image boundaries. Furthermore to enable the networks to classify two tasks at once, the output layer was split in two independent layers (see figure 3b) for which each had its own softmax activation. For classification performance results see table 1.

3.2.2 Measuring feature representation contribution

To determine the degree of feature representation sharing in a neural network we measure the contribution each feature representation has to both tasks. Our method is inspired by the attribute contribution decomposition by Robnik-Sikonja \BBA Kononenko (\APACyear2008) which has recently been used to visualize the inner workings of deep convolutional networks Zintgraf \BOthers. (\APACyear2017). The method is used to marginalize out features in the input image in the shape of small image patches, to observe the impact on the classification. In comparison, our method considers feature representations instead of features as we are not interested in the contribution of particular feature instances. The interested reader is referred to appendix A for the definition and derivation of the task contribution.

3.2.3 Results

We visualize the layer-wise task contributions by unrolling the feature representations of a layer on a rectangle and coloring each resulting cell by the composition of its contribution. Blue is used as indicator for the subordinate-level recognition task and yellow as indicator for the text- and basic-level-recognition task respectively. Equal contribution to both tasks results in grayish to white tones while little contribution to either task causes dark to black tones (see figure 4 for the color coding). A high degree of feature representation sharing would hereby generate cells colored in the range from black and gray to white, while low degree of sharing would result in more pronounced and clearly distinguishable colors of yellow and blue.

The two visualizations in figure 4 show a substantial difference in feature representation contribution as the representations in layer 2 to 5 of the RelNN contribute to both tasks much more equally than the representations of the UnrelNN. This is in line with our expectation depicted in figure 2 and our choice of setups. Contrary to our prediction, the degree of feature representation sharing in layer 1 of the UnrelNN is lower than expected; this can be explained by assuming that text recognition is mostly independent of all features but horizontal and vertical lines. Note also that most of the representations in the fully connected layers in both setups have only little contribution. This might seem counter-intuitive at first sight but is an effect of the abundance of representations coupled with the training scheme involving dropout. Dropout significantly reduces co-dependencies between units Dahl \BOthers. (\APACyear2013) resulting in only small changes in classification probability after marginalizing out a single representation.

We also observe that there is a dominance of blue cells expressing low contribution to the text- and basic-level-recognition task but high contribution to the subordinate-level-recognition task. We conjecture that this is because the subordinate-level-recognition task uses a larger fraction of units to distinguish between 200 classes.

Comparing the layers of both networks, it becomes evident that there generally is a higher degree of feature representation sharing in the RelNN consistent with the idea that relatedness between tasks and therefore cost functions strongly influences the degree of feature representation sharing across layers. More importantly, these results demonstrate that these types of ideas can be translated, using the right image data-sets and task-labels, into quantifiable predictions on the degree of feature sharing that might be observed in the brain.

4 Implications of models optimized for multiple tasks for understanding the visual system

In section 3 we presented an example in which the degree to which feature representations can be shared in a neural network depended on the relatedness of the tasks they are optimized for. In a neural population under pressure of the optimization for two unrelated tasks and the pressure to optimize the length of neuronal wiring Chklovskii \BBA Koulakov (\APACyear2004), a spatial segregation is likely to occur, resulting in anatomically and functionally separate pathways. In this section we consider to what degree we can understand the organization of the visual system from the perspective of a DNN that has been trained on multiple tasks and discuss three hypotheses derived from the simulations.

4.1 The visual system optimizes two cost functions of unrelated tasks

The early visual cortex has neurons that respond to properties such as orientation, wavelength, contrast, disparity and movement direction that are relevant for a broad range of visual tasks Wandell (\APACyear1995). Moving upwards from early cortex we see a gradual increase in the tuning specificity of neurons resulting in the dorsal and ventral pathways that have, as has become clear the last 25 years, unrelated goals Goodale \BBA Milner (\APACyear1992). The dorsal pathway renders the representation of objects invariant to eye-centered transformations in a range of reference frames to allow efficient motor planning and control Kakei (\APACyear1999), while the ventral pathway harbors object-centered, transformation invariant features Leibo \BOthers. (\APACyear2015); Higgins \BOthers. (\APACyear2016) to allow efficient object recognition.

These observations concur well with the predictions and experimental results we made about feature representation sharing in DNNs. Given that the two tasks, vision for recognition and vision for action, are mostly unrelated we can understand the gradual emergence of functional and anatomical separation between these systems from this perspective.

Nonetheless, we note that the functional units of the pathways beyond the occipital lobe are not entirely separated and cross-talk does exist between these pathways McIntosh \BBA Schenk (\APACyear2009); Farivar (\APACyear2009); de Haan \BBA Cowey (\APACyear2011); van Polanen \BBA Davare (\APACyear2015): a phenomenon we also observed in our experiment in section 3. In the UnrelNN, there are feature representations that contribute to both tasks throughout all layers of the network. Consequently the brain might trade off contribution and wiring length so that neurons that contribute little are tolerable to have long wiring to the functional epicentre.

As a whole the existence of two pathways guided by two cost functions of unrelated tasks might be seen as an illustration of the efficient decomposition of the overall vision function.

4.2 The visual pathways contain further task decompositions each with their own cost functions

We further generalize our perspective on cost function optimization of the visual system via the general observation made from machine learning that a complex task becomes simpler to solve if it is decomposed into simpler smaller tasks (see section 2.3). Given that the tasks we assign to the visual pathways are rather complex and vague we conjecture that there might be a broad range of cost functions active in the pathway regions to optimally decompose the task of vision resulting in a schematic similar to figure 5.

The ventral and dorsal pathways are each involved in a multitude of tasks serving the overall goals of vision for perception and vision for action. Examples of subordinate tasks for vision for action are localization, distance, relative position, position in egocentric space and motion and these interact with the goals that are part of vision for action: pointing, grasping, self-termination movements, saccades and smooth pursuit de Haan \BBA Cowey (\APACyear2011). Sub-ordinate tasks for vision for perception include contour integration, processing of surface properties, shape discrimination, surface depth and surface segmentation. These in turn interact with executing the goals that are part of vision for perception: categorization and identification of object but also scene understanding Groen \BOthers. (\APACyear2017).

Reasoning from this framework we can also understand the existence of multiple ‘processing streams’ within the dual pathways. For instance, within ventral cortex there appears to be a pathway for object recognition and a pathway for scene perception. The object recognition pathway consists of areas like V4 which responds to simple geometric shapes and the anterior part of inferior temporal (aIT) that is sensitive for complete objects Kravitz \BOthers. (\APACyear2013). The scene recognition pathway contains areas such as the occipital place area (OPA), involved in the analyses of local scene elements and the parahippocampal place area (PPA) which responds to configurations of these elements Kamps \BOthers. (\APACyear2016). The tasks of scene and object perception are closely related; scenes consist of objects. However, scene perception involves relating the positions of multiple objects to each other, scene gist and navigability Groen \BOthers. (\APACyear2017). From our framework we would predict that an area like OPA is mainly involved in the task of scene perception but has RFs that are also used for object perception and the opposite pattern for V4. Crucially, we believe this framework can be used to generate quantitative predictions for this amount of sharing.

4.3 Distributed versus modal representations

How information is represented is one of the major questions in cognitive neuroscience. When considering object based representations both distributed Haxby \BOthers. (\APACyear2001); Avidan \BBA Behrmann (\APACyear2009) and module-based representations Cohen \BOthers. (\APACyear2000); Kanwisher (\APACyear2000); Puce \BOthers. (\APACyear1995) have been observed.

Module-based representations, and theories stressing their importance, point to the existence of distinct cortical modules specialized for the recognition of particular classes such as words, faces and body parts. These modules encompass different cortical areas and, in case of the fusiform face area and visual word form area, even similar areas but in different hemispheres Plaut \BBA Behrmann (\APACyear2011). Conversely, distributed theories of object recognition point to the possibility to decode information from a multitude of classes from the patterns of activity present in a range of cortical regions Haxby \BOthers. (\APACyear2001); Avidan \BBA Behrmann (\APACyear2009).

If we consider feature representations in the early and intermediate layers of the UnrelNN (figure 4) as a reasonable approximation of representations in early / intermediate visual areas, we note that most units are being shared by both streams. However, some units contribute more to one than the other task and are spatially intermingled at the same time. An external observer, analyzing the activity of these representations under stimulation with pattern analysis would conclude that information from both tasks is present, and conclude that a distributed code is present. If the same observer would investigate the representations at the top of the stream the observer would conclude that there is an area dedicated to the analysis of text and another to the analysis of the subordinate task.

Translated to the visual system this would mean that distributed representations should be observed in areas such as posterior inferior temporal (pIT), OPA and V4 because these units are activated by multiple tasks but with a different weighting. Vice versa, at the top of a pathway or stream the network would show a strong module based pattern of activation. In sum, multi-task DNNs provide a framework in which we can potentially understand that both modal and distributed representations can be observed experimentally but suggest that the patterns of activity should be interpreted as emerging from the network as a whole.

5 Discussion

Following Marblestone and colleagues Marblestone \BOthers. (\APACyear2016), and the strength of the similarities between DNNs and the visual brain, we hypothesize that cost functions, associated with different tasks, are a major driving force for the emergence of different pathways.

A central insight from machine learning is that functions become easier to learn when they are decomposed as a set of unrelated subtasks. As a whole, the existence of two pathways guided by two cost functions of unrelated tasks might be seen as an illustration of the efficient decomposition of the overall vision function Sutton \BOthers. (\APACyear1999). Observing that DNNs decompose a problem in multiple steps, with the earlier layers related to the input and later layers related to outputs demanded for the task, we hypothesized that the degree of feature representation sharing between tasks, will be determined by the relatedness of the tasks with an upper-to-lower tier gradient.

On this basis, we performed simulations that confirm that units in a DNN show a strong degree of sharing when tasks are strongly related and a separation between units when tasks are unrelated. The degree to which this framework will be useful depends on the degree to which understanding elements of brain function using DNNs is valid which is discussed in section 5.1 and 5.2. Subsequently, we will argue that having multiple pathways within a multi-task network might also help explaining catastrophic forgetting, the phenomenon that an old task is overwritten by learning a new task (section 5.3). Next, we will discuss the ‘vision for perception’ and ‘vision for action’ framework (section 5.4), and finally we discuss the possibilities of using multi-task for further understanding the brain and ways in which our current analysis approach can be extended (section 5.5).

5.1 The biological realism of machine learning mechanisms

While there has been much progress in the field of Deep Learning, it remains a question how and if the weights of neurons are updated in learning under the supervision of cost functions in the brain, that is, what the actual learning rules of the brain are.

DNNs are trained using back-propagation, an algorithm believed to miss a basis in biology Crick (\APACyear1989); Stork (\APACyear1989). Some of the criticisms include the use in backpropagation of symmetrical weight for the forward inference and backward error propagation phase, the relative paucity of supervised signals and the clear and strong unsupervised basis of much learning. Recent research has shown that the symmetrical weight requirement is not a specific requirement Lillicrap \BOthers. (\APACyear2016). Roelfsema & Van Ooyen already showed in Roelfsema \BBA van Ooyen (\APACyear2005) that a activation feedback combined with a broadly distributed, dopamine-like error-difference signal can on average learn error-backpropagation in a reinforcement learning setting. Alternative learning schemes, like Equilibrium Propagation Scellier \BBA Bengio (\APACyear2017) have also been shown to approximate error-backpropagation while effectively implementing basic STDP rules.

Alternatively, effective deep neural networks could be learned through combination of efficient unsupervised discovery of structure and reinforcement learning. Recent work on predictive coding suggests this might indeed be feasible Whittington \BBA Bogacz (\APACyear2017). Still, the learning rules that underpin deep learning in biological systems are very much an open issue.

5.2 Cost functions as the main driver of functional organization

Reviewing literature on the computational perspective for functional regions in the visual system, we conclude that each region might be ultimately traced back to being under the influence of some cost function that the brain optimizes and its interplay or competition for neurons Jacobs \BOthers. (\APACyear1991) with other cost functions resulting in different degrees of feature representation sharing. The domain-specific regions in the ventral stream for example may be caused by a cost function defined to optimize for invariance towards class-specific transformations Leibo \BOthers. (\APACyear2015), of which the Fusiform Face Area could additionally be bootstrapped from a rudimentary objective, hard coded by genetics, to detect the pattern of two dots over a line McKone \BOthers. (\APACyear2012); Marblestone \BOthers. (\APACyear2016). Finally, as we argued in section 4, the functional separation of the ventral and dorsal pathway can be associated with two cost functions as well. We emphasize that the precise implementation of these cost functions is unknown and note the concept of the task “vision for recognition” and “vision for action” is merely a summary of all the subordinate tasks that these two tasks have been decomposed into, as argued in section 2.3 and the cost function box.

Reviewing literature on the computational perspective for functional regions in the visual system, we conclude that each region might be ultimately traced back to being under the influence of some cost function that the brain optimizes and its interplay or competition for neurons Jacobs \BOthers. (\APACyear1991) with other cost functions resulting in different degrees of feature representation sharing. The domain-specific regions in the ventral stream for example may be caused by a cost function defined to optimize for invariance towards class-specific transformations Leibo \BOthers. (\APACyear2015), of which the Fusiform Face Area could additionally be bootstrapped from a rudimentary objective, hard coded by genetics, to detect the pattern of two dots over a line – being the basic constellation of a face McKone \BOthers. (\APACyear2012); Marblestone \BOthers. (\APACyear2016). Finally, as we argued in section 4, the functional separation of the ventral and dorsal pathway can be associated with two cost functions as well. We emphasize that the precise implementation of these cost functions is unknown and note the concept of the task “vision for recognition” and “vision for action” is merely a summary of all the subordinate tasks that these two tasks have been decomposed into, as argued in section 2.3 and the cost function box.

5.3 Multiple pathways as a solution for catastrophic forgetting

While joint cost functions can be learned when the quantities needed by the cost functions are all present at the same time, most animals are continually learning and different aspects of cost functions are present at different times. Then, it is well known that standard neural networks have great difficulty learning a new task without forgetting an old task, so-called catastrophic forgetting. Effectively, when training the network for the new task, the parameters that are important for the old task are changed as well, with negative results. While very low learning rates, in combination with an alternating learning scheme, can mitigate this problem to some degree, this is costly in terms of learning time. For essentially unmixed outputs, like controlling body temperature and optimizing financial welfare, an easy solution is to avoid shared parameters, resulting in separate neural networks, or “streams”. Similarly, various properties can be derived from a single stream, like visual aspects (depth, figure-ground separation, segmentation), from an object recognition stream, where each aspect substream is learned via a separate cost function. For tasks sharing outputs, and thus having overlap over different tasks, evidence increasingly suggests that the brain selectively “protects” synapses for modification by new tasks, effectively “unsharing” these parameters between tasks Kirkpatrick \BOthers. (\APACyear2016).

5.4 What and where vs. vision for action and perception

Goodale & Milner argued that the concept of a ‘what and where’ pathway should be replaced by the idea that there are two pathways with different computational goals, vision for perception and vision for action, summarized as a ‘what’ and ‘how’ pathway Goodale \BBA Milner (\APACyear1992). Insights from the last 25 years of research in vision science have shown that the original idea of a what and where pathway lack explanatory power. It is clear that RFs in inferior temporal cortex are large when objects are presented on a blank background Gross \BOthers. (\APACyear1985). However, these become substantially smaller and thereby implicitly contain positional information, when measured against a natural scene background Rolls \BOthers. (\APACyear2003). Interestingly, studies on DNNs have shown that approximate object localization can be inferred from a CNN trained on only classification, although the spatial extend of an object cannot not be estimated Oquab \BOthers. (\APACyear2015).

With regards to the dorsal pathways it has been observed that there are cells relating to gripping an object that are specific for object-classes Brochier \BBA Umiltà (\APACyear2007) showing that this pathway contains, in addition to positional information, categorical information. These observations are in direct opposition to one of the central assumptions, a strong separation between identity and location processing, of the ‘what’ and ‘where’ hypothesis. It is now abundantly clear that the move from ‘what’ and ‘where’ pathway to ‘what’ and ‘how’ pathways and moving from input to function fits particularly well with vision as a multi-task DNN.

5.5 Future research

Originally DNNs were criticised for being “black” boxes, and using DNNs to understand the brain would equate to replacing one black box with another. Recent years have shown a rapid increase in our understanding of what makes a DNN work LeCun \BOthers. (\APACyear2015); Simonyan \BBA Zisserman (\APACyear2014); Zeiler \BBA Fergus (\APACyear2014) and how to visualize the features Zintgraf \BOthers. (\APACyear2017); Zhou \BOthers. (\APACyear2014); Zeiler \BBA Fergus (\APACyear2014) that give DNNs its power.

These developments illustrate that DNNs are rapidly becoming more “gray” boxes, and are therefore a promising avenue into increasing our understanding of the architecture and computations used by the visual system and brain.

We therefore believe it is sensible to investigate to which degree multi-task DNNs, trained using the same input, will allow us to understand the functional organisation of the visual system. Using the analytical framework introduced in section 3, we can generate a fingerprint for each of the layers in a network based on the degree of feature representation sharing. This can be subsequently related to the activation patterns, evoked by different tasks observed within different cortical areas. Alternatively it is possible to compare representational dissimilarity matrices Kriegeskorte \BOthers. (\APACyear2008) obtained from single and multitask-DNNs and determine which better explain RDMs obtained from cortical areas.

An open question remains how subtasks and their associated cost functions are learned from overall goals/general cost functions, both in machine learning Lakshminarayanan \BOthers. (\APACyear2016) and in neuroscience Marblestone \BOthers. (\APACyear2016); Botvinick \BOthers. (\APACyear2009).

Acknowledgements

MML is supported by a grant from the ABC, KR is supported by a grant from COMMIT and EHFdH by an ERC (339374 - FAB4V).

Appendix A Measuring parameter contribution

A.1 Marginalization of parameters

In models that are able to handle the lack of information about a particular representation like in naïve Bayesian classifiers, the contribution can be measured by marking the representation as unknown. Typically though, neural networks are not able to handle missing information and setting the parameters of a representation to zero will still result in interpretable information for subsequent layers. While removing a feature representation and retraining the network would alleviate this issue, quantifying the contribution of thousands of representations this way is generally unfeasible. Instead we make use of the models classification probabilities given by the softmax activation output which allows us to estimate the classification probability while lacking a representation by marginalizing it out via standard method from statistics. Marginalization effectively computes the weighted average of the classification probabilities after the representation has been replaced with random values sampled from an appropriate distribution. See equation 1 for the mathematical definition used for our evaluation.

[TABLE]

$p(y|x,\Theta)$ defines here the probability of input $x$ belonging to class $y$ and $p(y|x,\Theta_{\setminus\theta})$ the probability if $\theta$ is unknown. Note that a feature representation is represented by its parameters $\theta$ , which in turn consists classically of a weight $w$ and a potential bias $b$ in a neural network setting. $\Theta$ defines then the set of all parameters such that $\theta\in\Theta$ . Each classification probability is eventually weighted by the prior probability of the sample $\theta$ expressing the likelihood the parameter in question takes value $\theta$ . We used 100 samples in our experiments to approximate the contribution.

A.1.1 Derivation

Given a parametric model like a DNN that is described by its parameters $\Theta$ , we can express the probability of input $x$ belonging to class $y$ as $p(y|x,\Theta)$ , where the probabilities are given by the softmax output layer. To measure the contribution of a feature generated by parameter $\theta\in\Theta$ , we are interested in what the probability is when $\theta$ is missing or unknown. By assuming that the input is independent of the parameters as well as the parameters are independent of each other, such that $p(x,\Theta)=p(x)p(\Theta)$ and $p(\Theta)=p(\Theta_{\setminus\theta})p(\theta)$ and by treating the parameters as random variables we can marginalize out $\theta$ as follows.

[TABLE]

As the integral over all possible values of $\theta$ is intractable for DNN-like structures, we instead approximate the probability by sampling from $\theta$ a finite number of times. We can now express the upper equation with a sum over all samples of $\theta$ .

[TABLE]

To sample from $\theta$ , we assume that the values are normal distributed with uniform variance and mean centered at the learned weight $w$ and bias $b$ :

[TABLE]

A.2 Generalizing contributions from classes to tasks

As proposed by Robnik-Sikonja \BBA Kononenko (\APACyear2008), we use the weighted evidence ( $WE$ ) to measure the contribution of parameter towards class probability $p(y|x,\Theta)$ (see equation A.6) instead of taking the difference of probabilities directly. $WE_{\theta}(y|x,\Theta)$ gives us a positive value indicating $\theta$ adds evidence for class $y$ for input $x$ , while a negative value adds evidence against class $y$ and zero if $\theta$ has no contribution at all. To eventually determine the contribution towards a class independent of the input we calculate the arithmetic mean of the absolute weighted evidence over more than 500 input samples (see equation 10) from the test set.

[TABLE]

We finally measure the contribution to a task $t$ by selecting the contributions $C_{\theta}(y|\Theta)$ that satisfy $y=y_{true}$ which are the class predictions that are correct. Furthermore filtering out predictions that had been incorrectly inferred from the network, we can increase certainty that the inputs used to evaluate the contributions lead to high probability for the correct class and low everywhere else. We further generalize the contribution of $\theta$ to task $t$ by averaging over the contributions to each class $y_{k}$ within task $t$ (see equation 11).

[TABLE]

Bibliography79

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1Avidan \BBA Behrmann ( \APA Cyear 2009) \APA Cinsertmetastar Avidan 2009-cu {APA Crefauthors} Avidan, G. \BCBT \BBA Behrmann, M. \APA Cref Year Month Day 200914 \APA Cmonth 07. \BBOQ \APA Crefatitle Functional MRI reveals compromised neural integrity of the face processing network in congenital prosopagnosia Functional MRI reveals compromised neural integrity of the face processing network in congenital prosopagnosia. \BBCQ \APA Cjournal Vol Num Pages Curr. Biol.19131146–1150. \Prin
2Baxter ( \APA Cyear 1998) \APA Cinsertmetastar Baxter 1998-el {APA Crefauthors} Baxter, J. \APA Cref Year Month Day 1998. \BBOQ \APA Crefatitle Theoretical Models of Learning to Learn Theoretical models of learning to learn. \BBCQ \B In \APA Crefbtitle Learning to Learn Learning to learn ( \BPGS 71–94). \Print Back Refs \Current Bib
3Bilen \BBA Vedaldi ( \APA Cyear 2016) \APA Cinsertmetastar Bilen 2016-bv {APA Crefauthors} Bilen, H. \BCBT \BBA Vedaldi, A. \APA Cref Year Month Day 2016. \BBOQ \APA Crefatitle Integrated perception with recurrent multi-task neural networks Integrated perception with recurrent multi-task neural networks. \BBCQ \B In D \BPBI D. Lee, M. Sugiyama, U \BPBI V. Luxburg, I. Guyon \BCBL \BBA R. Garnett ( \BEDS ), \APA Crefbtitle Advances in Neural Information Processing Systems 29 Adv
4Botvinick \B Others . ( \APA Cyear 2009) \APA Cinsertmetastar Botvinick 2009-qs {APA Crefauthors} Botvinick, M \BPBI M., Niv, Y. \BCBL \BBA Barto, A \BPBI C. \APA Cref Year Month Day 2009 \APA Cmonth 12. \BBOQ \APA Crefatitle Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective Hierarchically organized behavior and its neural foundations: a reinforcement learning perspective. \BBCQ \APA Cjournal Vol Num Pages Cognition 1133262–280. \Print Bac
5Brochier \BBA Umiltà ( \APA Cyear 2007) \APA Cinsertmetastar Brochier 2007-bk {APA Crefauthors} Brochier, T. \BCBT \BBA Umiltà, M \BPBI A. \APA Cref Year Month Day 2007 \APA Cmonth 12. \BBOQ \APA Crefatitle Cortical control of grasp in non-human primates Cortical control of grasp in non-human primates. \BBCQ \APA Cjournal Vol Num Pages Curr. Opin. Neurobiol.176637–643. \Print Back Refs \Current Bib
6Caruana ( \APA Cyear 1998) \APA Cinsertmetastar Caruana 1998-ix {APA Crefauthors} Caruana, R. \APA Cref Year Month Day 1998. \BBOQ \APA Crefatitle Multitask Learning Multitask learning. \BBCQ \B In S. Thrun \BBA L. Pratt ( \BEDS ), \APA Crefbtitle Learning to Learn Learning to learn ( \BPGS 95–133). \APA Caddress Publisher Springer US. \Print Back Refs \Current Bib
7Chklovskii \BBA Koulakov ( \APA Cyear 2004) \APA Cinsertmetastar Chklovskii 2004-ki {APA Crefauthors} Chklovskii, D \BPBI B. \BCBT \BBA Koulakov, A \BPBI A. \APA Cref Year Month Day 2004. \BBOQ \APA Crefatitle Maps in the brain: what can we learn from them? Maps in the brain: what can we learn from them? \BBCQ \APA Cjournal Vol Num Pages Annu. Rev. Neurosci.27369–392. \Print Back Refs \Current Bib
8Cichy \B Others . ( \APA Cyear 2016) \APA Cinsertmetastar Cichy 2016-sw {APA Crefauthors} Cichy, R \BPBI M., Khosla, A., Pantazis, D., Torralba, A. \BCBL \BBA Oliva, A. \APA Cref Year Month Day 201610 \APA Cmonth 06. \BBOQ \APA Crefatitle Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition reveals hierarchical correspondence Comparison of deep neural networks to spatio-temporal cortical dynamics of human visual object recognition rev