Non-uniqueness phenomenon of object representation in modelling IT   cortex by deep convolutional neural network (DCNN)

Qiulei Dong; Bo Liu; Zhanyi Hu

arXiv:1906.02487·q-bio.NC·June 7, 2019

Non-uniqueness phenomenon of object representation in modelling IT cortex by deep convolutional neural network (DCNN)

Qiulei Dong, Bo Liu, Zhanyi Hu

PDF

Open Access

TL;DR

This paper identifies a non-uniqueness problem in using deep convolutional neural networks to model neural object representations in primate cortex, highlighting a theoretical limitation of this approach.

Contribution

It uncovers an inherent non-uniqueness issue in DCNN-based modeling of neural object representations, emphasizing the need for caution in practical applications.

Findings

01

Non-uniqueness phenomenon exists in DCNN models

02

Highlights theoretical limitations of DCNN in neural modeling

03

Calls for careful interpretation of DCNN-based neural representations

Abstract

Recently DCNN (Deep Convolutional Neural Network) has been advocated as a general and promising modelling approach for neural object representation in primate inferotemporal cortex. In this work, we show that some inherent non-uniqueness problem exists in the DCNN-based modelling of image object representations. This non-uniqueness phenomenon reveals to some extent the theoretical limitation of this general modelling approach, and invites due attention to be taken in practice.

Figures10

Click any figure to enlarge with its caption.

Tables1

Table 1. Table 1: Network configurations (shown in columns). The convolutional layer parameters are denoted as “Conv ⟨ ⟨ \langle receptive field size ⟩ ⟩ \rangle -bn- ⟨ ⟨ \langle number of channels ⟩ ⟩ \rangle ”. The Fully connected layer parameters are denoted as “Fc- ⟨ ⟨ \langle number of units ⟩ ⟩ \rangle ”.

ConvNet Configuration
D1	D2	D3	D4	D5	D6
5 Layers	8 Layers	8 Layers	8 Layers	15 Layers	9 Layers
Input(32*32 RGB Image)
Conv5-32	Conv3-bn-32	Conv3-bn-64	Conv3-bn-128	Conv3-bn-32	Conv3-bn-64
	Conv3-bn-32	Conv3-bn-64	Conv3-bn-128	Conv3-bn-32
				Conv3-bn-32
				Conv3-bn-32
Max-pool
Conv5-32	Conv3-bn-64	Conv3-bn-128	Conv3-bn-256	Conv3-bn-64	Conv3-bn-128
	Conv3-bn-64	Conv3-bn-128	Conv3-bn-256	Conv3-bn-64
				Conv3-bn-64
				Conv3-bn-64
Max-pool
Conv5-64	Conv3-bn-128	Conv3-bn-256	Conv3-bn-512	Conv3-bn-128	Conv3-bn-256
	Conv3-bn-128	Conv3-bn-256	Conv3-bn-512	Conv3-bn-128	Conv3-bn-256
				Conv3-bn-128
				Conv3-bn-128
Max-pool
Fc-64	Conv3-bn-256	Conv3-bn-512	Conv3-bn-1024	Conv3-bn-256	Conv3-bn-512
				Conv3-bn-256	Conv3-bn-512
	Max-pool
					Conv3-bn-512
					Conv3-bn-512
					Max-pool
Fc-10	Fc-10	Fc-10(100)	Fc-100	Fc-10	Fc-10(100)

Equations5

C_{x} = (\frac{e ^{x_{1}}}{\sum _{i = 1}^{N} e ^{x_{i}}}, \frac{e ^{x_{2}}}{\sum _{i = 1}^{N} e ^{x_{i}}}, \dots, \frac{e ^{x_{N}}}{\sum _{i = 1}^{N} e ^{x_{i}}})^{T}

C_{x} = (\frac{e ^{x_{1}}}{\sum _{i = 1}^{N} e ^{x_{i}}}, \frac{e ^{x_{2}}}{\sum _{i = 1}^{N} e ^{x_{i}}}, \dots, \frac{e ^{x_{N}}}{\sum _{i = 1}^{N} e ^{x_{i}}})^{T}

C_{y} = (\frac{e ^{y_{1}}}{\sum _{i = 1}^{N} e ^{y_{i}}}, \frac{e ^{y_{2}}}{\sum _{i = 1}^{N} e ^{y_{i}}}, \dots, \frac{e ^{y_{N}}}{\sum _{i = 1}^{N} e ^{y_{i}}})^{T}

y = (W_{2}^{T} W_{2})^{+} W_{2}^{T} (y^{'} - b_{2}) = (W_{2}^{T} W_{2})^{+} W_{2}^{T} (F (W_{1} S_{1} (I) + b_{1}) - b_{2})

y = (W_{2}^{T} W_{2})^{+} W_{2}^{T} (y^{'} - b_{2}) = (W_{2}^{T} W_{2})^{+} W_{2}^{T} (F (W_{1} S_{1} (I) + b_{1}) - b_{2})

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAdvanced Neural Network Applications · Face Recognition and Perception · Domain Adaptation and Few-Shot Learning

MethodsDiffusion-Convolutional Neural Networks

Full text

1

Non-uniqueness phenomenon of object representation in modelling IT cortex by deep convolutional neural network (DCNN)

**Qiulei Dong1,2,3, Bo Liu1,2, Zhanyi Hu1,2,3,∗

1**National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100190, China.

2University of Chinese Academy of Sciences, Beijing 100049, China.

3CAS Center for Excellence in Brain Science and Intelligence Technology, Beijing 100190, China.

∗ Corresponding author: [email protected]

Abstract

Recently DCNN (Deep Convolutional Neural Network) has been advocated as a general and promising modelling approach for neural object representation in primate inferotemporal cortex. In this work, we show that some inherent non-uniqueness problem exists in the DCNN-based modelling of image object representations. This non-uniqueness phenomenon reveals to some extent the theoretical limitation of this general modelling approach, and invites due attention to be taken in practice.

Author summary

In the field of neuroscience, DCNN has been advocated recently as a general and promising modelling approach for neural object representation in primate inferotemporal cortex. However, the following uniqueness problem on the fundamental premise of this modelling approach is still unclear: does there exist a unique representation in the penultimate layer of a DCNN for a given set of image stimuli by only optimizing the object categorization performance? This problem has a great influence on the theoretical foundation and generality of the DCNN-based modelling approach. In this work, we provided a theoretical analysis on this problem as well as some supporting experimental results, and showed that there exists a non-uniqueness phenomenon of object representation under the DCNN-based modelling approach. Hence, we suggest that when DCNNs are used for modeling sensory cortex as a general framework, it is necessary for people to be aware of this potential and inherent non-uniqueness problem, and appropriate network architectures in DCNN learning should be carefully considered.

1 Introduction

Object recognition is a fundamental task of a biological vision system. It is widely believed that the primate inferotemporal (IT) cortex is the final neural site for visual object representation. Due to viewpoint change, illumination variation and other factors, how visual objects are represented in IT cortex, which manifests sufficient invariance to such identity-orthogonal factors, is still largely an open issue in neuroscience.

There are many different natural and manmade object categories, and each category in turn contains various different members. Neuroscientists generally believed that “the computational goal of object representation is likely the same across all of IT cortex” [1], although special cortical areas do exist for face, body parts, buildings, etc. Currently, a number of works in neuroscience advocate the DCNN (Deep Convolutional Neural Network) as a new framework for modelling vision and brain information processing [2, 3]. In [4, 5], DCNN is regarded as a promising general modelling approach for understanding sensory cortex, called “the goal-driven approach”.

The basic idea of the goal-driven approach for IT cortex modelling can be summarized as: a multi-layered DCNN is trained by ONLY optimizing the object categorization performance with a large set of visual category-labeled objects. Once a high categorization performance is achieved, the outputs of the penultimate layer neurons of the trained DCNN, which are regarded as the object representation, can reliably predict the IT neuron spikes for other visual stimuli in rapid object recognition.111The goal-driven approach is for modelling IT neuron representation in rapid object vision, which is assumed largely a feed forward process, hence could be modelled by DCNNs which are also feed forward networks.. In addition, the outputs of the upstream layer neurons can also predict the V4 neuron spikes. The goal-driven approach is conceptually eloquent and has been successfully used to model IT cortex in rapid object recognition and predict category-orthogonal properties [6].

2 Does the goal-driven approach satisfy the uniqueness requirement in modelling IT cortex?

2.1 Motivation

Although some experimental results have demonstrated the success of the goal-driven approach in modelling IT cortex to some extent as mentioned above, the following uniqueness problem on the fundamental premise of the goal-driven approach is still unclear: does there exist a unique pattern of activations of the neurons (units) in the penultimate layer of a DCNN to a given set of image stimuli by only optimizing the object categorization performance? This uniqueness problem on object representation via a DCNN has a great influence on the theoretical foundation and generality of the goal-driven approach in particular, and the DCNN as a new framework for vision modelling in general.

In this work, we aim to provide a theoretical analysis on this problem as well as some supporting experimental results. In order to analyse this problem more clearly, we firstly introduce the definition of DCNN layer’s object representation as used for predicting the neuron responses of primate IT cortex in the aforementioned goal-driven approach:

Definition 1.

For a layer of a DCNN for object recognition, the activations of the neurons in this layer to an input object image is defined as its object representation.

Following the convention in the computational neuroscience, the following representation equivalence is introduced to evaluate whether the object representations learnt from two DCNNs are the same or not:

Definition 2.

Given a set of object image stimuli, if the two object representations of two DCNNs on these stimuli can be related by a linear transformation, they are considered equivalent, or the same representations. Otherwise, they are different representations.

In the deep learning community, a recent active research topic is called “convergent learning” [7], referring whether different DCNNs can learn the same representation at the level of neurons or groups of neurons. A generally reached conclusion is that different DCNNs with the same network architecture but trained only with different random initializations, have largely different representations at the level of neurons or groups of neurons, although their image categorization performances are similar. Note that although Li et al.’s work and the goal-driven approach focus on the representation from different points of view, the representations in the two works are closely related. Hence, the results in [7] could also re-highlight the aforementioned uniqueness problem in object representation via a DCNN to some extent.

Addressing this uniqueness problem, we show in the following section that, in theory, by only optimizing the image categorization accuracy, different DCNNs can give different object representations though they have exactly the same categorization accuracy. In other words, the obtained object representations by DCNNs under the goal-driven approach could be inherently non-unique, at least in theory.

2.2 Theoretical analysis and experimental results

Proposition 1.

If the ‘Softmax’ function is used as the final classifier for image categorization in modelling $N$ categories of objects via a DCNN, and the object category with the largest probability is chosen as the final categorization, and if $x=(x_{1},x_{2},\cdots,x_{N})^{T}\in R^{N}$ is the final output of this DCNN for an input image object $I$ , $f(\cdot)$ is a univariate nonlinear monotonically increasing function, $y\triangleq(y_{1},y_{2},\cdots,y_{N})^{T}=F(x)=(f(x_{1}),f(x_{2}),\cdots,f(x_{N}))^{T}$ , then $x$ and $y$ give exactly the same categorization result.

Proof: For $x$ and $y$ , their corresponding probability vectors by Softmax are respectively:

[TABLE]

Since $y_{i}=f(x_{i})$ ( $i=1,2,\cdots,N$ ) and $f(\cdot)$ is a monotonically increasing function, the magnitude order of elements for $x$ and $y$ does not change. Then the magnitude order of the two probability vectors $C_{x}$ and $C_{y}$ does not change. Since the object category with the largest probability is chosen as the final categorization, both the indices of the largest elements in $C_{x}$ and $C_{y}$ are the same, hence the same categorization results are obtained for $x$ and $y$ . $\blacksquare$

Remark 1: Since $f(\cdot)$ is a nonlinear function, $x$ and $y$ cannot be related by a linear transformation. In addition, in the deep learning community, the Softmax function is commonly used to convert the output vector of the network into a probability vector, and the category with the largest probability value is chosen as the final category.

Remark 2: In theory, $f(\cdot)$ could be different for different input image $I$ . More generally, even the demand of monotonicity for $f(\cdot)$ is unnecessary, we need only the index of the largest value in $y$ is the same to that in $x$ because only the largest value determines the correct categorization. For the Top- $K$ categorization accuracy, we need the index set of the $K$ largest values in $y$ keep the same to that in $x$ , and the rest elements are not required. Hereinafter, for the notational convenience in discussion and practicality of implementation, we always assume $f(\cdot)$ is a univariate nonlinear monotonically increasing function.

Proposition 2.

As shown in Figure 1, assume that DCNN1 is a multi-layered network, concatenating a sub-network DCNN ${}^{P}_{1}$ whose output is $x$ , and a fully connected layer with weight matrix $W_{1}\in R^{N\times M}$ and bias $b_{1}\in R^{N\times 1}$ ( $\{M,N\}$ are the numbers of neurons at the penultimate layer and last layer of DCNN1 respectively, with $M>N$ ), with $x^{\prime}=W_{1}x+b_{1}$ . And assume that DCNN2 is a multi-layered network, concatenating a sub-network DCNN ${}^{P}_{2}$ whose output is $y$ , and a fully connected layer with weight matrix $W_{2}\in R^{N\times M}$ and bias $b_{2}\in R^{N\times 1}$ , with $y^{\prime}=W_{2}y+b_{2}$ . If $y^{\prime}=f(x^{\prime})$ in element-wise mapping where $f(\cdot)$ is a monotonically increasing function, then the object representation $x$ under DCNN1 cannot be related by a linear transformation to the object representation $y$ under DCNN2, or $x$ and $y$ are two different object representations under the goal-driven approach.

Proof: Since $y^{\prime}=f(x^{\prime})$ in element-wise mapping where $f(\cdot)$ is a monotonically increasing function, according to Proposition 1, DCNN1 and DCNN2 have the identical image object categorization performance.

Since $x^{\prime}=W_{1}x+b_{1}$ , then $x=(W_{1}^{T}W_{1})^{+}W_{1}^{T}(x^{\prime}-b_{1})$ , where $A^{+}$ denotes the pseudo-inverse of matrix $A$ . Similarly, $y=(W_{2}^{T}W_{2})^{+}W_{2}^{T}(y^{\prime}-b_{2})$ . By Proposition 1, $x^{\prime}$ and $y^{\prime}$ is related by a nonlinear function, then $x$ and $y$ cannot be related by a linear transformation either. In other words, $x$ and $y$ are two different object representations under the goal-driven approach. $\blacksquare$

Remark 3: Since $\{W_{1},W_{2}\}\in R^{N\times M}$ and $M>N$ in Proposition 2, the pseudo-inverse operator is used in the above proof. Here are a few words on the pseudo-inverse: Since $M>N$ , which is the usual case in most existing DCNNs for object categorization [8, 9, 10], the inverse $(W_{i}^{T}W_{i})^{+}$ ( $i=1,2$ ) is not unique , but the equalities in $x=(W_{1}^{T}W_{1})^{+}W_{1}^{T}(x^{\prime}-b_{1})$ and $y=(W_{2}^{T}W_{2})^{+}W_{2}^{T}(y^{\prime}-b_{2})$ can be strictly met.

Proposition 2 indicates that given DCNN1 with output $x^{\prime}$ , if there exists another multi-layered network DCNN2 to output $y^{\prime}=f(x^{\prime})$ , their representations $x$ and $y$ would be different but with identical categorization performance. This means that the aforementioned non-uniqueness problem in object representation modelling under the goal-driven approach would arise regardless of how many training images are used, and how many exemplar images in each category are included. In other words, the non-uniqueness problem is an inherent problem in DCNN modelling under the goal-driven approach, and it cannot be completely removed by using more training data, at least in theory.

In the above, an implicitly assumption is that given a DCNN1 with the output $x_{i}^{\prime}$ , there always exists a DCNN2 with the output $y_{i}^{\prime}=f(x_{i}^{\prime})$ . Does such a DCNN2 really always exist? This issue can be separately addressed for the following two cases. The first one is that DCNN1 and DCNN2 could be of different architectures, and the second one is that they are of the same architecture, but merely initialized differently during training.

The different architecture case

Proposition 3.

There always exists a multi-layered network to map $I_{i}$ to $y_{i}$ for the given input-output pairs $\{(I_{i}\leftrightarrow y_{i}),i=1,2,\cdots,n\}$ in Proposition 2.

Proof: As shown in Proposition 2 and Figure 1, since DCNN1 exists, it maps $I$ to $x$ . Denote this mapping function as $x=S_{1}(I)=DCNN^{P}_{\mathit{1}}(I)$ . Since $x^{\prime}=W_{1}x+b_{1}$ , $y^{\prime}=F(x^{\prime})=((f(x_{1}^{\prime}),f(x_{2}^{\prime}),\cdots,f(x_{n}^{\prime}))$ , $y^{\prime}=W_{2}y+b_{2}$ , and $y=(W_{2}^{T}W_{2})^{+}W_{2}^{T}(y^{\prime}-b_{2})$ , we have:

[TABLE]

This is just the required mapping function. By the Universal Approximation Theorem in [11], there always exists a DCNN, denoted as DCNN2, whose sub-network DCNN ${}^{P}_{2}$ is able to approximate this function. $\blacksquare$

Proposition 3 indicates that given a DCNN1, there always exists a DCNN2 whose architecture may be different from DCNN1, so that the object representations of the two DCNNs are different but with the same categorization performance. A training procedure is described in the Appendix, to show how to train such a pair of DCNN1 and DCNN2.

Remark 4: In the proof, the only requirement for DCNN2 is that it should have sufficient capacity to represent the input object set, but it does not necessarily have a similar network architecture to DCNN1. Note that the sufficient representational capacity is an implicit necessary requirement for any DCNN-based applications.

Remark 5: In the proof, the number of input images is assumed to be unknown. However for the finite-input case, Theorem 1 in [12] guarantees that there exists a two-layered neural network with ReLU activation and ( $2n+d$ ) weights, which could represent any mapping function from input to output on sample of size $n$ in $d$ dimensions. Of course, such a constructed network could be of a memorized neural network, i.e., it can ensure the given finite inputs to be mapped to the required outputs, but it cannot guarantee that the constructed network could possess sufficient generalization ability for new samples.

The same architecture case

When DCNN1 and DCNN2 are obtained with the same network architecture but only trained under different random initializations, clearly a theoretical proof is impossible. However, based on the reported results in the “convergent learning” literatures as well as our simulated experimental results, it seems they still largely have non-equivalent object representations although they have similar categorization performances.

(1) Non-uniqueness results from “convergent learning” literatures

Using AlexNet [8] as a benchmark, Li et al. [7] showed that by keeping the architecture unchanged but only trained with different random initializations, the obtained 4 DCNNs have similar categorization performances, but their object representations are largely different in terms of one-to-one, one-to-many, and many-to-many linear representation mapping. Note that the many-to-many mapping in [7] is closely related to the equivalence representation in Definition 2. Hence, the 4 representations are largely non-equivalent and this non-equivalence becomes more prevalent with increasing convolutional layers.

By introducing the concepts of “ $\epsilon$ -simple match set” and “ $\epsilon$ -maximum match set”, Wang et al. [13] showed that for the 2 representative DCNNs, VGG [9] and ResNet [14], the size of maximum match set between the activation vectors of individual neurons at the same layer of the two DCNNs, which are also obtained with only different initializations as did in [7], is tiny compared with the number of the neurons at that layer. It was further found that only the outputs of neurons in the $\epsilon$ -maximum match set can be approximated within $\epsilon$ -error bound by a linear transformation, which indicates that for majority of the neurons at the same layer, their outputs cannot be reasonably approximated by a linear transformation, or the corresponding object representations are largely not equivalent.

(2) Non-uniqueness results from our experiments

Definition 3.

If two DCNNs, DCNN1 and DCNN2, have similar image categorization performances with the same network architecture but different parameter configurations, they are called the similar performing pair of DCNNs.

Generally speaking, our results further confirm the non-uniqueness phenomenon of object representation under the goal-driven approach. We systematically investigated the representation differences between a similar performing pair of DCNNs on the two public object image datasets, CIFAR-10 that contains 60,000 images belonging to 10 categories of objects and CIFAR-100 that contains 60,000 images belonging to 100 categories of objects [15]. In our experiments, 5,000 images per category in CIFAR-10 (also 500 images per category in CIFAR-100) were randomly selected for network training, and the rest for testing. Six network architectures with different configurations (denoted as $\{$ D1, D2, D3, D4, D5, D6 $\}$ ) were employed for evaluations, where $\{$ D1, D2, D3, D5, D6 $\}$ were for CIFAR-10 and $\{$ D3, D4, D6 $\}$ were for CIFAR-100 as shown in Table 1.

The traditionally used measure, “explained variance”(EV), was employed to access the degree of linearity between the learnt object representations from a similar performing pair of DCNNs, and we trained similar performing pairs of DCNNs under the following two schemes:

Scheme-1

Both DCNN1 and DCNN2 were trained with random initializations.

Scheme-2

Similar to the training procedure in the DCNN1 was firstly trained with the Softmax loss, and then DCNN2 was trained by combining the Softmax loss on the neuron outputs of the last layer and the Euclidean loss on the differences between the neuron outputs of the penultimate layer in DCNN2 and the corresponding terms calculated according to Eq. (3) (In our experiments, $f(x)=|x|\sqrt{x}$ ).

Here are some main results from our experiments:

(i) Explained variance on standard data

The results using the training Scheme-1 are shown in Figure 2. Figure 2 and Figure 2 show the categorization accuracies of similar performing pairs of DCNNs under different network architectures with two random initializations on CIFAR-10 and CIFAR-100 respectively. The blue bars of Figure 2 and Figure 2 show the corresponding mean EVs on CIFAR-10 and CIFAR-100 respectively. As seen from Figure 2 and Figure 2, the mean EVs by $\{$ D1, D2, D3, D5, D6 $\}$ are around $63.4\%\sim 87.5\%$ on CIFAR-10, while the mean EVs by $\{$ D3, D4, D6 $\}$ are around $53.6\%\sim 65.9\%$ on CIFAR-100. In addition, the mean EV of the network D1 under the training Scheme-2 is $51.2\%$ on CIFAR-10.

Two points are revealed from these results:

•

Given a similar performing pair of DCNNs, although the representations of the two DCNNs cannot in theory be related by a linear transformation, the explained variance between the two representations is relatively large.

•

A similar performing pair of DCNNs with a deeper architecture, or having more layers, will generally have a larger explained variance between the two representations. The underlying reason seems that since a DCNN with a deeper architecture will generally have a larger representational capacity and since a fixed task has a fixed representation demand, a DCNN with a larger capacity will give a more linear representation.

In addition, for a similar performing pair, although their categorization performances are similar, it does not mean that the two DCNNs have the identical categorization label for each input sample, either correct or wrong. We have manually checked the categorization results for CIFAR-10 and CIFAR-100. The orange bars of Figure 2 and Figure 2 show the computed mean EVs for only those inputs correctly categorized. As seen from Figure 2, the discrepancy of the explained variances between the representations of only the correctly categorized inputs and those of the whole inputs is insignificant and negligible in most cases, and it is perhaps due to the already high categorization rate of the two DCNNs such that the incorrectly categorized inputs only take a small fraction of a relatively large test set.

(ii) Explained variance on noisy data

In [16], it is reported that DCNNs are sometimes sensitive to adversarial images, that is, images slightly corrupted with random noise, which do not pose any significant problem for human perception, but dramatically alter the categorization performance of DCNNs. Here, we assessed the noise effects on the representation equivalence on CIFAR-10. The input images are normalized to the range $[0,1]$ , and Gaussian noise with mean 0 and standard variance $\sigma=\{0.01,0.02,0.03,0.04,0.05,0.07,0.1\}$ are added into these images respectively. Figure 3 shows the corresponding categorization accuracies of similar performing pairs of DCNNs under different architectures, while Figure 3 shows the corresponding mean EVs. We find that even under the noise level $\sigma=0.1$ , the explained variance does not change much, although the categorization accuracy decreases notably.

**(iii) Variations of explained variance by changing stimuli size

**In the neuroscience, the number of stimuli could not be too large. However, for image categorization by DCNNs, the size of the test set could be very large. Does the size of stimuli set play a role on the explained variance? To address this issue, we assessed the explained variance as the dataset size increases by resampling subsets from the original test set of images in CIFAR-10. Here, image subset sizes of $[1000,2000,\cdots,10000]$ are evaluated. Figure 4 and Figure 4 show the results on the resampled subsets from the whole set of test data and the set of only those images which are correctly categorized respectively. Our results show that if the size of the stimuli set reaches a modestly large number (around $3000$ ), the explained variance stabilized. That is to say, we do not need a too large number of stimuli for reliably estimating explained variance. In other words, stimuli in the order of thousands could already reveal the essence, and a further increase of stimuli could not alter much the estimation.

**(iv) Explained variance vs neuron selectivity

**Clearly, some DCNN neurons are more selective than others [17, 18]. Using the kurtosis [19] of the neuron’s response distribution to image stimuli, we investigated whether neuron selectivity has some correlation with the explained variance. We chose top $\{10\%,20\%,\cdots,100\%\}$ most selective neurons from each DCNN in a similar performing pair respectively, then computed the explained variance between the two chosen subsets, and the results are shown in Figure 5. As seen from Figure 5, with the increase of the percentage of selective neurons, the explained variance increases accordingly. This indicates that for the object representations of a similar performing pair of DCNNs, neuron selectivity is also an influential factor on their explained variance. The explained variance between the subsets of more selective neurons is smaller, and this result seems to be in concert with the conclusion in [20] where it is shown that neuron selectivity does not imply the importance in object generalization ability.

**(v) A good representation does not necessarily needs IT-like

**In the literature [2], it is shown that if an object representation is IT-like, it can give a good object recognition performance. This work shows that the inverse is not necessarily true, at least theoretically speaking. That is, as shown in the above experiments and discussions, many different representations can give the same or quite similar recognition results with/without noise.

Remark 6: In this work, we assume the final classifier is a Softmax classifier. For other linear classifiers, the general concluding remark of non-equivalence can be similarly derived. Of course, if the used classifier is a nonlinear one, or the output of the penultimate layer is further processed by a nonlinear operator before inputting it to a linear classifier, as done in [1], where a 3-order polynomial is used as a preprocessing step for the final classification, our results will no longer hold. But as shown in [21], monkey IT neuron responses can be reliably decoded by a linear classifier, we thought using Softmax as the final classifier for DCNN-based IT cortex modelling could not constitute a major problem for our results.

3 Conclusion

Here, we would say that we are not against using DCNNs to model sensory cortex. In fact, its potential and usefulness have been demonstrated in [4, 5]. Here, we only provide a theoretical reminder on the possible non-uniqueness phenomenon of the learnt object representations by DCNNs, in particular, by the goal-driven approach proposed in [5]. As shown in the convergent-learning literatures, such a non-uniqueness phenomenon is prevalent in deep learning, hence when DCNNs are used for modelling sensory cortex as a general framework, people should be aware of this potential and inherent non-uniqueness problem, and appropriate network architectures in DCNN learning should be carefully considered.

Author Contributions

Zhanyi Hu conceived of the non-uniqueness phenomenon of object representation in modelling IT cortex by DCNN. Qiulei Dong and Zhanyi Hu explored the method. Qiulei Dong and Bo Liu implemented the explored method and performed the validation. Qiulei Dong and Zhanyi Hu wrote the paper.

Acknowledgements

This work was supported by the Strategic Priority Research Program of the Chinese Academy of Sciences (XDB32070100), and National Natural Science Foundation of China (U1805264, 61573359).

Appendix

Procedure to train DCNN1 and DCNN2:

Input: A set of $n$ image objects: $D=\{I_{i},i=1,2,\cdots,n\}$ with known categorization labels.

Output: DCNN1 and DCNN2 whose object representations are different but with the same (or similar) categorization performance;

1

Using $D=\{I_{i},i=1,2,\cdots,n\}$ to train a DCNN by optimizing the categorization performance. This training can be done similarly as reported in numerous image categorization literatures. Denote the trained DCNN as DCNN1. The output of the penultimate layer in DCNN1 for $D$ is denoted as $X=\{x_{i},i=1,2,\cdots,n\}$ , $x_{i}$ is the output for input image $I_{i}$ . Denote the output of the final layer in DCNN1 for $D$ as: $X^{{}^{\prime}}=\{x_{i}^{{}^{\prime}},i=1,2,\cdots,n\}$ , the weighting matrix at the final layer in DCNN1 is $W_{1}$ and the bias vector is $b_{1}$ , that is $x_{i}^{{}^{\prime}}=W_{1}x_{i}+b_{1}$ ;

2

Choose a nonlinear monotonically increasing function $f(\cdot)$ , and compute $Y^{{}^{\prime}}=\{y_{i}^{{}^{\prime}},i=1,2,\cdots,n\}$ , where $y_{i}^{{}^{\prime}}=f(x_{i}^{{}^{\prime}})$ in element-wise mapping;

3

Choose a weighting matrix $W_{2}$ for the second DCNN, say $W_{2}=W_{1}$ ;

4

Compute $Y=\{y_{i},i=1,2,\cdots,n\}$ by $y_{i}=(W_{2}^{T}W_{2})^{+}W_{2}^{T}(y_{i}^{{}^{\prime}}-b_{2})$ ;

5

Using training pair $\{(I_{i}\leftrightarrow y_{i}),i=1,2,\cdots,n\}$ to train the second DCNN to minimize the Euclidean loss between the DCNN’s output $\tilde{y}_{i}$ and $y_{i}$ .

6

The trained DCNN in step (5) is our required DCNN2. The object representation $x_{i}$ of DCNN1 and $y_{i}$ of DCNN2 are different representations by Definition 2, because for the same object $I_{i}$ , $x_{i}$ and $y_{i}$ can give the same categorization results in theory without noise, or similar results with noise in practice, but they cannot be transformed by a linear transformation as shown in Proposition 2.

Bibliography21

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Chang, L., Tsao, D. (2017) The Code for Facial Identity in the Primate Brain. Cell , 1013-–1028.
2[2] Khaligh-Razavi, S., Kriegeskorte, N. (2014). Deep Supervised, but Not Unsupervised, Models May Explain IT Cortical Representation. PLOS Computational Biology , 10(11) , e 1003915.
3[3] Cadieu, C., Hong, H., Yamins, D., Pinto, N., Ardila, D., Solomon, E., et al. (2014) Deep Neural Networks Rival the Representation of Primate IT Cortex for Core Visual Object Recognition. P Lo S Comput Biol , 10(12) , e 1003963. doi:10.1371/journal.pcbi.1003963.
4[4] Yamins, D., *Hong, H., Cadieu, C., Solomon, E., Seibert, D., Di Carlo, J. (2014) Performance-optimized hierarchical models predict neural responses in higher visual cortex. Proc. Natl. Acad. Sci. , 111(23) , 8619–8624.
5[5] Yamins, D., Di Carlo, J. (2016) Using goal-driven deep learning models to understand sensory cortex. Nature Neuroscience , 19(3) , 356–365.
6[6] Hong, H., Yamins, D., Majaj, N., Di Carlo, J. (2016). Explicit information for category-orthogonal object properties increases along the ventral stream. Nature Neuroscience , 19 , 613–622.
7[7] Li Y., Yosinski J., Clune J., Lipson H., Hopcroft J. (2016) Convergent Learning: Do different neural networks learn the same representations? In Proc. ICLR .
8[8] Krizhevsky, A., Sutskever, I. and Hinton, G. E. (2012). Image Net Classification with Deep Convolutional Neural Networks. In Proc. Advances in Neural Information Processing 25 .

TL;DR

Contribution

Findings

Abstract

Peer Reviews

Videos

Taxonomy

Abstract

Author summary

1 Introduction

2 Does the goal-driven approach satisfy the uniqueness requirement in modelling IT cortex?

2.1 Motivation

Definition 1**.**

Definition 2**.**

2.2 Theoretical analysis and experimental results

Proposition 1**.**

Proposition 2**.**

Proposition 3**.**

Definition 3**.**

3 Conclusion

Author Contributions

Acknowledgements

Appendix

Definition 1.

Definition 2.

Proposition 1.

Proposition 2.

Proposition 3.

Definition 3.