Locality-Promoting Representation Learning
Johannes Schneider

TL;DR
This paper reveals that CNN filters tend to have larger weights near the center, introduces a regularization method to promote this locality, and demonstrates improved accuracy across various architectures and datasets.
Contribution
The paper introduces Locality-promoting Regularization (LOCO-Reg) to enforce spatial locality in CNN filters, improving performance and providing theoretical insights.
Findings
Weights near filter centers are larger than those on the outside.
LOCO-Reg improves accuracy across multiple CNN architectures.
The empirical locality pattern is explained by maximizing feature cohesion.
Abstract
This work investigates fundamental questions related to learning features in convolutional neural networks (CNN). Empirical findings across multiple architectures such as VGG, ResNet, Inception, DenseNet and MobileNet indicate that weights near the center of a filter are larger than weights on the outside. Current regularization schemes violate this principle. Thus, we introduce Locality-promoting Regularization (LOCO-Reg), which yields accuracy gains across multiple architectures and datasets. We also show theoretically that the empirical finding is a consequence of maximizing feature cohesion under the assumption of spatial locality.
| Dataset | Architecture | All Layers | Lower Layers | Upper Layers | |||
|---|---|---|---|---|---|---|---|
| VGG16 | 0.549∗∗∗ | 0.502∗∗∗ | 0.577∗∗∗ | 0.543∗∗∗ | 0.546∗∗∗ | 0.499∗∗ | |
| ImageNet | ResNet50 | 0.548∗∗∗ | 0.531∗∗∗ | 0.529∗∗∗ | 0.564∗∗∗ | 0.551∗∗∗ | 0.527∗∗∗ |
| [pre-trained, | InceptionV3 | 0.49∗∗∗ | 0.447∗∗∗ | 0.573∗∗∗ | 0.532∗∗∗ | 0.483∗∗∗ | 0.439∗∗∗ |
| from Keras] | Xception | 0.626∗∗∗ | 0.486∗∗∗ | 0.689∗∗∗ | 0.572∗∗∗ | 0.576∗∗∗ | 0.419∗∗∗ |
| MobileNet | 0.63∗∗∗ | 0.555∗∗∗ | 0.805∗∗∗ | 0.775∗∗∗ | 0.588∗∗∗ | 0.504 | |
| MobileNet Adaption [8] | VGG10 Adaption [9] | ||
|---|---|---|---|
| Type/Stride | Filter Shape | Type/Stride | Filter Shape |
| C/s1 | C/s1 | ||
| C dw/s1 | MP/s2 | ||
| C/s1 | C/s1 | ||
| C dw/s2 | C/s1 | ||
| C/s1 | MP/s2 | ||
| C dw/s1 | C/s1 | ||
| C/s1 | C/s1 | ||
| C dw/s2 | MP/s2 | ||
| C/s1 | C/s1 | ||
| C dw/s1 | C/s1 | ||
| C/s1 | MP/s2 | ||
| C dw/s2 | C/s1 | ||
| C/s1 | C/s1 | ||
| C dw/s1 | MP/s2 | ||
| C/s1 | |||
| C dw/s2 | |||
| C/s1 | |||
| C dw/s1 | |||
| C/s1 | |||
| C dw/s2 | |||
| FC/s1 | nClasses | FC/s1 | nClasses |
| Soft/s1 | Classifier | Soft/s1 | Classifier |
| Acc. | Acc. | Acc. | Acc. | ||||
| 1.6,2.4 | .8812∗∗ | 1.6,2.2 | .8793∗ | 1.6,2.0 | .8815∗ | 1.6,1.8 | .8819 |
| 2.4,1.6 | .874 | 2.2,1.6 | .8781 | 2.0,1.6 | .8789 | 1.8,1.6 | .8789 |
| 1,1 | .8790∗∗∗ | 1,1 | .8790∗ | 1,1 | .8790∗∗ | 1,1 | .8790∗ |
| .6,.4 | .8677 | .6,.6 | 0.8767 | .8,.6 | .8733 | .8,.8 | .8772 |
| Dataset | Architecture | () | Avg. Accuracy for different | Best | |||
| .00025 | .0005 | .001 | .002 | Acc. | |||
| cifar10 | MobileNet | (1,1) | .8611 | .8686 | .8688 | .8647 | .8688 |
| cifar10 | MobileNet | (1.4,1.56) | .8618 | .8701∗ | .8714 | .8657 | .8714 |
| cifar10 | MobileNet | (1.8,2.13) | .8619 | .8692 | .8721∗ | .8668∗ | .8721∗ |
| cifar10 | ResNet | (1,1) | .9191 | .9227 | .9236 | .9222 | .9236 |
| cifar10 | ResNet | (1.4,1.56) | .921 | .9253∗ | .9242 | .9224 | .9253∗ |
| cifar10 | ResNet | (1.8,2.13) | .9186 | .9244∗ | .9237 | .9236 | .9244∗ |
| cifar10 | VGG | (1,1) | .8754 | .8761 | .882 | .8858 | .8858 |
| cifar10 | VGG | (1.4,1.56) | .8722 | .884 ∗∗ | .8858∗∗ | .8869 | .8869 |
| cifar10 | VGG | (1.8,2.13) | .8808∗∗ | .8816∗ | .8875∗∗∗ | .8884∗ | .8884∗ |
| cifar100 | MobileNet | (1,1) | .5926 | .6116 | .6182 | .6155 | .6182 |
| cifar100 | MobileNet | (1.4,1.56) | .5941 | .6124 | .6182 | .6149 | .6182 |
| cifar100 | MobileNet | (1.8,2.13) | .5935 | .6144 | .6199 | .6184∗ | .6199 |
| cifar100 | ResNet | (1,1) | .702 | .71 | .7156 | .7124 | .7156 |
| cifar100 | ResNet | (1.4,1.56) | .702 | .7129∗ | .7163 | .7146 | .7163 |
| cifar100 | ResNet | (1.8,2.13) | .7022 | .7116 | .7198∗∗ | .7142 | .7198∗∗ |
| cifar100 | VGG | (1,1) | .6415 | .6551 | .6597 | .6599 | .6599 |
| cifar100 | VGG | (1.4,1.56) | .6432 | .6583∗ | .6665∗∗∗ | .6645∗ | .6665∗∗∗ |
| cifar100 | VGG | (1.8,2.13) | .6449∗ | .6629∗∗∗ | .6653∗∗ | .6671∗∗∗ | .6671∗∗∗ |
| fashion | MobileNet | (1,1) | .9403 | .9402 | .939 | .9369 | .9403 |
| fashion | MobileNet | (1.4,1.56) | .9398 | .9406 | .9385 | .9372 | .9406 |
| fashion | MobileNet | (1.8,2.13) | .9402 | .9408 | .9398 | .9371 | .9408 |
| fashion | ResNet | (1,1) | .9501 | .9504 | .9494 | .9492 | .9504 |
| fashion | ResNet | (1.4,1.56) | .9496 | .951 | .9506∗ | .9489 | .951 |
| fashion | ResNet | (1.8,2.13) | .9509∗ | .9505 | .9515∗ | .9494 | .9515∗ |
| fashion | VGG | (1,1) | .9404 | .942 | .9417 | .9426 | .9426 |
| fashion | VGG | (1.4,1.56) | .941 | .9414 | .9419 | .9436∗ | .9436∗ |
| fashion | VGG | (1.8,2.13) | .9423 | .9417 | .9436∗ | .9437∗ | .9437∗ |
| Size | (Eq. 5) | Acc. | Size | Acc. |
|---|---|---|---|---|
| 5x5 | 1,0 | .866 | 7x7 | .856 |
| 5x5 | 1,0.25 | .873∗∗ | 7x7 | .863∗ |
| 5x5 | 1,0.35 | .874∗∗∗ | 7x7 | .864∗∗∗ |
| Dataset | () | Acc. | Dataset | () | Acc. |
| DeepW. | (1,1) | .73 | TinyIm. | (1,1) | .4284 |
| DeepW. | (1.4,1.56) | .7393∗ | TinyIm. | (1.4,1.56) | .4313∗ |
| DeepW. | (1.8,2.13) | .7484∗ | TinyIm. | (1.8,2.13) | .4323∗ |
| Network and Regularization | Initializa. | Accuracy |
|---|---|---|
| ACNet[5](BaseLine) | Default | 86.18 |
| ACNet with 3x3 filters, Default-Reg, | Default | 86.20 |
| ACNet with 3x3 filters, STRIP-Reg, | Default | 86.24 |
| ACNet with 3x3 filters, STRIP-Reg, | Default | 86.36 |
| ACNet with 3x3 filters, STRIP-Reg, | Scaled | 86.40 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
Methods*Communicated@Fast*How Do I Communicate to Expedia? · Batch Normalization · Bottleneck Residual Block · Ethereum Customer Service Number +1-833-534-1729 · Residual Connection · Convolution · Residual Block · Average Pooling · Concatenated Skip Connection · Bitcoin Customer Service Number +1-833-534-1729
Locality-Promoting Representation Learning
Johannes Schneider
Institute of Information Systems,
University of Liechtenstein, Vaduz,Liechtenstein
Abstract
This work investigates questions related to learning features in convolutional neural networks (CNN). Empirical findings across multiple architectures such as VGG, ResNet, Inception and MobileNet indicate that weights near the center of a filter are larger than weights on the outside. Current regularization schemes violate this principle. Thus, we introduce Locality-promoting Regularization, which yields accuracy gains across multiple architectures and datasets. We also show theoretically that the empirical finding could be explained by maximizing feature cohesion under the assumption of spatial locality.
I Introduction
While the design of deep learning architectures is still a very active field of research, feature engineering seems to be passé thanks to deep learning’s capability for end-to-end learning. Feature hierarchies are learnt almost miraculously during the optimization of the loss function. Questions such as “What areas (or patterns) of objects constitute good features?”, eg. in terms of generalization capability or robustness, have received fairly little attention. In this work, we shed some light on properties and learning of individual features. More precisely, we investigate how weights of features in spatial dimensions of CNNs should be distributed and how to regularize them accordingly. To this end, we focus on spatial data relying on empirical observations and well-known principles, ie. the Principle of Locality. Locality is a known theme in physics and in computer science, eg. [1]. In machine learning, CNNs or other fundamental primitives such as word-vectors [2] use some form of “locality”, ie. the idea that interaction strength decreases with distance. It justifies ignoring dependencies among data items, if their distances are above a threshold. Thus, “windows” or “patches” of inputs can be used rather than the entire input. Prior work by Bengio et al. [3] has motivated ideas for representation learning based on principles from physics, but not using the Locality Principle. Our work addresses a call by Lake et al. [4] for better grounding of deep learning by using physical principles. Under the assumption of locality and aiming for cohesive features, it can be observed that weights of features in spatial dimensions close to a center are more relevant, ie. larger. The theoretical finding is supported by empirical evidence through investigating learnt features of multiple architectures. Existing L2-regularization schemes regularize all locations equally, thereby counteracting locality by reducing more central weights too much. Our first locality promoting L2-regularization scheme (“LOCO-Reg”) for fostering locality results in improvements across multiple architectures such as ResNet, VGG and MobileNet. Furthermore, as a by-product, LOCO-Reg provides a new type of architectural element being a compromise between (spatial) convolutions and convolutions with . For example, dependent on the regularization parameters, a convolution using LOCO-Reg might resemble more of a convolution or be closer to a convolution with standard L2-regularization. Our second locality promoting L2-regularization scheme (“STRIP-Reg”) applies this idea to recent work [5] to replace 1x3 and 3x1 filters with 3x3 filters.
II Spatial Weights Distribution
We investigate weights of features along spatial dimensions, that is, width and height of two dimensional images. We denote as a spatial filter, a filter that is of width and height larger one but has just one channel, ie. it is of depth one. Multiple spatial filters constitute a filter, where each spatial filter corresponds to a “channel”. Weights near a spatial filter’s center are on average larger than those near its boundary as illustrated in Figure 1. We also investigated more formally, if there are significant differences in how often a weight at one position is larger than at another. For a 3x3 spatial filter map let the indicator be 1, if the weight at the center, ie. at location , is larger than the weight to the left of it, ie. at . More generally, , if for a spatial filter holds . We denote by the average of the indicators for a set of spatial filters . We conduct a binomial-test to investigate whether the average is significantly larger than the expectation, ie. 0.5. Table I shows the outcome for various architectures. For lower layers there is consistent support for the hypotheses that weights near the center are larger, but for higher layers it is mostly weaker or there is no support. The overall results is dominated by higher layers, since they contain more features, ie. more spatial filters. One explanation for the discrepancy between lower and higher layers is that down-sampling yields more abstract, semantically richer features with less precise localization. Localization of features in upper layers might be less relevant, since the goal of classification is not localization. Multiple variations of the test, eg. using different locations or comparing weight magnitudes all yield that more central weights are larger. Overall, we conclude that evidence for locality is likely to be found in any network, but not necessarily on upper layers.
III Locality-Promoting Regularization
Current L2-regularization practice might not be optimal. It regularizes weights equally within a spatial filter, thus, battling against the natural tendency for weights in the center being larger than outer ones (Section II). Thus, we aim to encourage learning of weights so that those near the center are larger than those on the outer areas of a spatial filter (A theoretical motivation follows in Section IV). One mechanism to guide the learning process is to regularize weights near the center less. To do so, the overall regularization parameter is adjusted by a factor depending on location . For a spatial filter the regularization loss becomes , for L2-regularization. Note, that for conventional L2-regularization we have . For simplicity of notation, we focus on spatial filter. Larger filters are considered in the evaluation. We propose two implementations of this idea:
LOCO-Reg
The suggested LOCO-Reg (LOCality-prOmoting Regularization) loss for each spatial filter , corner indexes and the four nearest neighbors of the center given by:
[TABLE]
We use a symmetric function , ie. all corners are regularized identically with parameter , all direct neighbors of the center are regularized with parameter . It is convenient to constrain , so that . The constraint implies that the meaning of remains that of the magnitude of the overall regularization. For parameters , we get equal regularization of all weights, ie. standard regularization. LOCO-Reg requires parameters , which encourages larger weights near the center. To better understand the behavior of LOCO-Reg, or more generally regularization with parameters that differ spatially, consider the regularization behavior towards the end of the optimization process, ie. once features have been learnt and are not supposed to change significantly. Assume the network learnt an optimal filter in terms of generalization performance. The L2-regularization loss is . One condition that optimal regularization parameters should fulfill is that they should not regularize one of the (optimal) weights of more than another, since this would yield a non-optimal filter. That is, we want for the L2-loss terms that for all . This implies that for a constant . To determine in we can use the aforementioned constraint that is , ie. . The formula implies that regularization should be less where weights are larger. Thus, given the empirical observation that learnt (locally) optimal weights of the majority of spatial filter have larger weights (Section II), LOCO-Reg might indeed be preferable to uniform regularization.
Next, we provide some intuition, on what kinds of spatial filters are fostered using LOCO-Reg. Consider the task of locating two features given just a single feature map shown in Figure 2.111More generally, one might also think of Figure 2 as a weighted aggregation of feature maps. Since most values are small, ie. 0 or 1, and potentially superimposed by noise, features seem to be present where the values of the feature maps are large. Figure 2 shows the outcomes of two strategies for feature definition. The strategy that treats all parts of a proposed feature uniformly places features so that the aggregated sum of inputs covered by the feature is maximal (blue features). The other strategy aligned with LOCO-Reg also seeks areas where the sum is large, but it places more importance on locations near the center (red features). Thus, we might learn features with larger center weights that activate more at maxima of a feature map compared to uniformly regularized features.
STRIP-Reg
We also apply locality-promoting regularization to improve upon a recent architecture that uses the same empirical observation, ie. central weights are larger, to foster locality[5].222The (first) arxiv.org version of this paper appeared before theirs. Regularization contributes in an indirect manner to locality in [5] as discussed later. We show how locality-promoting L2-regularization might be accommodated in their scheme. For each 3x3 filter two additional filters are trained in [5]: One 1x3 filter applied to the same data as the center row of the 3x3 filter and one 3x1 filter applied to the center column. The outputs of all three filters are added. In architecture “wBN” all three filters use separate batchnorm layers before addition, in architecture “w/oBN” they use the same batchnorm layer applied after addition. After training, the weights of all three filters are merged into a single 3x3 filter using the additive property of convolutions. Thus, during inference only a single filter has to be evaluated as for other common architectures. We modify their architecture as follows: We replace the 3x1 and 1x3 filters with 3x3 filters but regularize weights more strongly that are not deemed relevant, ie. that are not present in [5]. We call this form of locality promoting regularization, STRIP-Reg, since it regularizes entire horizontal and vertical stripes based on their locality. For the 3x3 filter replacing the 3x1 filter , we regularize a weight with if and otherwise by . The parameter is the overall regularization strength and provides the additional regularization of weights not present in [5]. 1x3 filters are treated analogously. We also briefly assess the idea to initialize stronger regularized locations with lower (expected) values, ie. we use a scaled initialization: For the 3x3 filter maps replacing the 1x3 and 3x1 maps, we double the weights corresponding to the 1x3 and 3x1 maps and half all other weights after the default initialization, eg. Xavier. This leaves several statistical properties of common initialization schemes in tact, eg. the expectation of the sum of weights of a 3x3 filter remains unchanged for typical initialization schemes like Xavier with independent weight initialization, expectation and standard deviation . More precisely, , since . This also holds for expected magnitudes: . We have if all weights are initialized in the same manner, that is . For scaled regularization, 6 weights are initialized with half the expected magnitude and 3 with . This yields .
While [5] provides extensive experimental evaluation supported by a well-structured repo, it lacks any motivation (i) why weights near the center should be larger in general (see Section IV), (ii) why their method actually enforces weights near the center to be larger. We discuss point (ii) next.
We analyze the network scenario “w/oBN”, which uses a single batchnorm layer for all three added filters. It is more tractable for analysis: Let be the center weight of a spatial filter and a corner weight. [5] uses three weights per center of a spatial filter, ie. one for the regular 3x3 convolution, one for the horizontal 1x3 and one for the vertical 3x1 convolution , while they use just one for a corner originating from the 3x3 convolution. Thus during inference, one might replace and set to obtain the same outcomes as [5].333See [5] for a justification based on linearity of convolutions. While the idea to use multiple kernels has several consequences, one is that using three weights instead of one reduces the impact of L2-regularization for central weights. Assume that all weights have the same sign and then the regularization loss for a center weight is less if individual weights are regularized before summation, ie. , while the regularization loss is the same for corners . For most initialization the assumption that weights have the same sign does not hold. However, let be the optimal solution for weight then the optimal the solution for the added weights are since this minimizes the L2-loss, but it does not impact any loss depending on predictions, eg. softmax loss.
IV Theory: Feature Cohesion And Locality
We show that locality combined with the objective of obtaining cohesive features implies that the more central a location of a spatial filter is, the more relevant it is, ie. the larger the weight should be at that location. As in clustering, ideally features are dense, ie. positioned at areas with high density, and well-separated as captured by cluster assessment metrics like the Davies-Bouldin Index. Cohesion relates naturally to density and it implies that a feature is stable. We seek to define a feature of fixed dimension, ie. , so that cohesion is maximal. Our cohesion metric relies on a form of attraction analogous to “gravity” that has also been used in the context of clustering [6, 7]. We formulate the objective of obtaining highly cohesive features by demanding that feature parts should maximize their attraction. We use the common idealization of point masses, meaning that we neglect the spatial extension of parts of a feature and subsume their strength at a single point (Figure 3). Attraction or force between masses at locations and might be measured analogous to gravity , where is a constant (as for gravity), can be seen as the mass or absolute strength of a spatial feature at a location, is the Euclidean distance and a parameter, eg. we shall use as for gravity. This scenario is illustrated in Figure 3.
Cohesion might be measured using the sum of all interactions, i.e. the sum of forces among each pair of parts, ie. with the force on parts themselves being defined as 0, ie. . We compute distances by taking the differences between indexes, . Any distance is only proportional to the actual (physical) distance with some proportionality constant , which can be subsumed in the constant in the force .
Theorem 1**.**
For any feature strength distribution with , the cohesion of the feature is increased most by increasing , and more by increasing any than any for arbitrary , center , direct neighbors and corners (Figure 3).
The theorem shows that for many distributions of masses cohesion depends more on masses near the center, ie. increasing any of them yields larger gains in cohesion than masses far away from the center. Locality is incorporated in the definition stating a decrease of interactions with distance. The provided bounds in the theorem might not be tight, but irrespective of this, there are (pathological) cases of feature strength distributions so that enlarging the center might not increase cohesion more than enlarging other parts. For example, assume a corner mass, eg. , is much bigger than any other mass. Then, growing the mass of one of its direct neighbors, eg. , can lead to a more cohesive solution than growing the center . This follows since the center has larger distance to the corner.
The theorem follows essentially from the fact that the center has (on average) smaller distances to other parts making its contribution to overall cohesion the largest. Analogously, any direct neighbor of the center has (on average) smaller distances to other parts than any corner.
Proof.
We consider the dependency of the total force on the center , a corner mass and a direct neighbor mass using case enumerations. That is, we investigate the impact on the total force if one of them is changed, while the others stay fixed. Any interaction between masses not involving any of the masses , or can be neglected, since it is not impacted by altering the three masses. Furthermore, due to symmetry only few cases need to be considered.
First, we prove that increasing the center yields larger change than changing a corner or direct neighbor . We consider the contribution of the center mass to , ie. all interactions that involve (see Equations 2), as well as the contribution of the corner mass and the contribution of the direct neighbor mass . The cases for changing any other mass are symmetric. The forces can be computed by using the distances shown in Figure 3. Due to symmetry there are only 4 different distances .
[TABLE]
Next, we substitute and we compute the impact on the total force when changing one of the three chosen masses, ie. , and .
[TABLE]
We begin with comparing the change of the center to that of the direct neighbor , ie. we start by showing that
[TABLE]
We prove inequality (4) by showing that even under a “worst-case” distribution of masses the inequality holds. By assumption for any mass holds . Let us minimize the left hand side () and maximize the right hand side (). Formally, we can consider coefficients for masses in Equations 3. If a coefficient for a mass in is larger than in then the corresponding mass should be maximized, otherwise minimized. More intuitively, enlarging the mass at location increases more than if one of the two conditions holds: (i) if location is closer to the neighbor than to the center and (ii) if depends on location but does not. Condition (i) only applies to the two corners and that are nearest to . That is, these two corners should have maximal masses . For condition (ii) note that the increase of depends on the center (but not on ), whereas the change of does not depend on but on . Thus, the center mass should be maximal, ie. , and all others, including , are minimized, ie. set to . Substituting the suggested masses into Equations 3 gives:
[TABLE]
Setting the two terms, ie. and , equal gives , ie. the bound is .
An analogous consideration for the corner mass , ie. investigating if , yields that in contrast to the direct neighbor , for the corner there are no masses closer to the corner. Thus, this case is subsumed by the prior case for stated in Inequality 4, ie. the prior bound also applies.
Next, we show that changing any of the direct neighbors has greater impact on cohesion, ie. , than changing any of the corners . Due to symmetry it suffices to consider only one mass in , ie. we chose , and two corners, ie. we use and . We begin by showing that increasing the mass of neighbor has more impact than changing the corner . That is, we show . The derivatives are given in Equations 3.
Let us minimize the left hand side () and maximize the right hand side (). is minimized and maximized if masses closer to are maximized or those which only occur with positive coefficient in . This means , and are chosen to be maximal, ie. they are set to and other masses are minimized, ie. set to .
[TABLE]
Setting the two terms, ie. and , equal gives: , ie. .
Finally, we consider the more distant corner . We show . The derivative is given in Equations 3. For we have
[TABLE]
Let us minimize the left hand side () and maximize the right hand side (). As before, is minimized and maximized if masses closer to are maximized or those which only occur with positive coefficient in , ie. and are maximal, ie. they are set to and other masses are minimized, ie. set to . This gives:
[TABLE]
Setting the two terms equal gives . Thus, the bound that is valid for all cases is given by .
∎
V Experiments
We provide additional empirical support for locality and assess the performance of LOCO-Reg and STRIP-Reg. We investigate different hyper-parameter settings resulting in a total of five experiments (E1-E5) with more than 1000 trained networks.
V-A Setup and Analysis
We used SGD with momentum 0.9 and batchsize 128. We trained for 130 epochs decaying the initial learning rate of 0.1 by 0.3 after epochs 50, 85, 110, 120 and 127. Weights of dense layers were regularized with 0.0005 across all benchmarks. For overall L2-regularization we used , if not stated differently. We used multiple datasets and architectures, where CIFAR-10 [10] and a VGG-10 variant Table II is the default if not stated differently. We used the datasets default split into training and test data. Data augmentation consisted of horizontal flipping if not stated differently. We trained 15 networks for each configuration, ie. hyperparameter setting.We used the Wilcoxon rank-sum test to assess if results differed significantly for two configurations. This test is appropriate for small samples sizes that might contain outliers. We report the median accuracy for each configuration. In all tables bold values indicate best performance comparing values of one column and a set of 2 or 3 rows of differing . “” denotes a p-value , “” denotes a p-value , “” a p-value of compared to standard L2-Reg.
V-B E1: (Anti-)Locality-Promoting Regularization
Under the assumption of locality for 3x3 spatial filters regularizing outermost weights more than direct neighbors of the center, ie. , should yield better results than the opposite, ie. using . Outcomes for various in the upper part of Table III indeed support this hypotheses: resulted in better accuracies for all tests often with significant differences. Note that in each column of the upper part the center weight is regularized equally, eg. the regularization of the center is the same for and for any numbers . We also investigated upon regularizing the center weight more than weights further from it, which can be said to be “anti”-locality promoting. The results in the lower part of Table III indicate that regularizing centers more leads to worse outcomes.
V-C E2: LOCO-Reg on Multiple Architectures and Datasets
We used CIFAR-10, CIFAR-100 [10] and FASHION-MNIST [11], scaled to 32x32. We assess three networks types: VGG and MobileNet variants (Table II) as well as ResNet-10 [12]. They cover different design ideas, such as stacking many convolutional layers (VGG), using shortcuts (ResNet) and separating spatial and depth-wise convolutions (MobileNet).
We chose regularization parameters .444 They were chosen after five runs of standard regularization from a larger set of parameters, so that the best outcome was neither the min or max , ie. neither 0.0025 nor 0.002. For we used covering standard regularization as well as two parameter settings for LOCO-Reg.
Table IV shows results. The last column highlights that LOCO-Reg outperforms on average standard regularization across all datasets and architectures for the best . Gains fluctuate between to . When looking at the outcomes for each LOCO-Reg outperforms for 34 out of 36 settings (see bold values in Table IV). Often differences are significant despite conducting only 15 runs.
V-D E3: 5x5 and 7x7 spatial filters
We used VGG10 variants based on Table II for CIFAR-100 with convolutions larger than 3x3, ie. we utilized 5x5 as well as 7x7 convolutions. Specifically, we replaced each 3x3 convolution with 5x5 (as well as 7x7) if the convolution did not exceed the feature map’s width and height. That is, for VGG5×5 we replaced the first five layers of 3x3 convolutions with 5x5. In VGG7×7 3x3 filters in the first 3 layers were altered to 7x7 filters and in the 4th and 5th layer we changed to 5x5 filters. We have defined LOCO-Reg using two parameters (Equation 1). To generalize, we use a simple linear function that gives the regularization parameter given the distance from the center, ie.
[TABLE]
Outcomes are shown in Table V. The accuracy gains of LOCO-Reg are profound and highly significant. Since there are more than 2 distances from the center, specifying is not sufficient to achieve a gradual regularization from center to the corner. Table V contains the regularization factors sorted by distance.
V-E E4: LOCO-Reg on Larger Images
We also assessed LOCO-Reg on larger images namely 64x64 images (TinyImageNet Dataset555https://tiny-imagenet.herokuapp.com/) and images scaled to 128x128 from the DeepWeeds[13] dataset without data augmentation. We expanded the VGG architecture in Table II by adding a convolutional layer and a max-pooling layer with 32 filters after the first max-pooling layer. Results in Table VI indicate that LOCO-Reg yields gains for both datasets.
V-F E5: STRIP-REG and ACNet
We first assess the impact of parameter . We use , where implies that we use three 3x3 filters that are all treated identically. We used the code provided by [5] and trained a Resnet variant “RCNet-10” designed for CIFAR-10. We used a predefined learning schedule from [5] with SGD training for 80 epochs.
The results in Table VII indicate that merely using 3x3 filters has no impact. Small amounts of regularization to “outside” weights, ie. , yield no improvements. Large regularization yield significant improvements compared to the baseline, where scaling of initial values provides a marginal impact. P-values of the Wilcoxon rank-sum test comparing the baseline with STRIP-Reg with with default and scaled regularization are 0.05 and 0.04, respectively.
VI Related Work
Some priors for “what makes a good representation” have been briefly motivated by physics [3]. Locality or the idea of emphasizing the center of a representation has not been discussed. The closest weakly related idea is spatial coherence, which is said to imply slow changes of features across spatial dimensions. Locality implies that relevance of feature weights decrease with distance from the center. Disentangling of features is also an important aspect in representation learning. Works on feature disentanglement [14, 15] often constrain representations, eg. weight matrices might be enforced to be orthogonal [15]. These works aim at reducing the number of similar representations. Our approach does not directly constrain representations to be dissimilar. Since interactions decrease with distance, larger spatial separation might imply more differences in inputs. Thus, one might hope for obtaining more distinct feature representations compared to the ones that are learnt using patches of nearby or overlapping data.
Precise location of features has not been deemed essential in the early stages of deep learning. Early works [16] on image recognition as well as more recent architectures such as inception [17] or VGG [9] might use max-pooling that neglects spatial information by simply extracting the maximum out of a region without keeping any information on its position within the considered region. However, it has been recognized that operations such as convolutions [18] or fractional max-pooling [19] that allow to maintain spatial information more accurately are more suitable for down-sampling. Pooling has been seen as a mechanism to achieve (more) translation invariance. More elaborate approaches such as spatial transformer networks [20] allow to learn multiple transformations such as shear, translation and scaling. There is a significant body of work that discusses invariance, e.g. [21, 22]. [21] showed how to exploit groups of symmetries such as rotations and reflections. These works aim at replacing the learning of (many) transformable representations, by learning the transformation itself and (fewer) representations on which the transformation can be applied. In contrast, we are more concerned with defining individual representations well rather than reducing the number of representations.
Our work is also loosely related to aspects of the human visual system. Lateral inhibition is an effect, where neurons suppress their less-active neighbors [16, 23]. For instance, it plays a key role for Mach bands, where they increase the contrast between different tones of gray, ie. neurons perform high pass filtering. Our work is neither limited to high pass filtering nor do we strongly inhibit features. But for our implementation we also train features by slightly reducing the impact of outer parts of a feature, ie. the sub-features of which it is composed, to compensate for the promotion of the center.
There are many regularization schemes, such as L1- and L2-regularization, elastic nets [24] or versions of dropout such as dropconnect [25]. Other works have also used regularization as a means to achieve desired properties of feature representations, eg. regularization in [15] pushes pairs of kernels towards a small cosine metric, ie. towards being orthogonal. In contrast, we focus on regularization of elements of a representation rather than pairs of complete representations.
There are also loose connections to pruning of CNNs, eg.[26, 27, 28]. However, in pruning typically the goal is to fully remove weights, entire filters or replace filters (eg. 3x3 by 1x1) for computational efficiency with little performance loss, while we are primarily concerned with performance improvements.
VII Conclusions
Locations near the centre of spatial filters are more important than weights closer to the boundary, meaning that more central weights of the filters are preferable larger on average – at least for lower layers. The statement is based on empirical findings that are also aligned with theoretical principles and goals, ie. the Principle of Locality and the goal of obtaining cohesive features. The findings can be leveraged using non-uniform, spatial regularization leading to improvements on multiple architectures.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] L. Barenboim, M. Elkin, S. Pettie, and J. Schneider, “The locality of distributed symmetry breaking,” Journal of the ACM (JACM) , 2016.
- 2[2] T. Mikolov, K. Chen, G. Corrado, and J. Dean, “Efficient estimation of word representations in vector space,” Int. Conference on Learning Representations (ICLR) , 2013.
- 3[3] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence , vol. 35, no. 8, pp. 1798–1828, 2013.
- 4[4] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman, “Building machines that learn and think like people,” Behavioral and brain sciences , vol. 40, 2017.
- 5[5] X. Ding, Y. Guo, G. Ding, and J. Han, “Acnet: Strengthening the kernel skeletons for powerful cnn via asymmetric convolution blocks,” in Proc. of Int. Conference on Computer Vision (ICCV) , 2019.
- 6[6] A. Hatamlou, S. Abdullah, and H. Nezamabadi-Pour, “Application of gravitational search algorithm on data clustering,” in International Conference on Rough Sets and Knowledge Technology , 2011.
- 7[7] C.-Y. Chen, S.-C. Hwang, and Y.-J. Oyang, “An incremental hierarchical data clustering algorithm based on gravity theory,” in Pacific-Asia Conf. on Knowledge Discovery and Data Mining , 2002.
- 8[8] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” ar Xiv preprint ar Xiv:1704.04861 , 2017.
