Advancements in Image Classification using Convolutional Neural Network

Farhana Sultana; A. Sufian; Paramartha Dutta

arXiv:1905.03288·cs.CV·May 27, 2019

Advancements in Image Classification using Convolutional Neural Network

Farhana Sultana, A. Sufian, Paramartha Dutta

PDF

TL;DR

This paper reviews the evolution of CNN architectures for image classification, highlighting advancements from LeNet-5 to SENet, and compares their models and training details.

Contribution

It provides a comprehensive overview of CNN architecture developments and performance comparisons for image classification tasks.

Findings

01

SENet outperforms earlier CNN models in accuracy.

02

Advancements in CNN architectures improve image classification performance.

03

Detailed comparison of CNN models from LeNet-5 to SENet.

Abstract

Convolutional Neural Network (CNN) is the state-of-the-art for image classification task. Here we have briefly discussed different components of CNN. In this paper, We have explained different CNN architectures for image classification. Through this paper, we have shown advancements in CNN from LeNet-5 to latest SENet model. We have discussed the model description and training details of each model. We have also drawn a comparison among those models.

Tables3

Table 1. TABLE I : Architecture of LeNet-5

Layer	filter size/stride	# filter	output size	#Para- meters
Convolution(C1)	$5 \times 5$ /1	6	$28 \times 28 \times 6$	156
Sub-sampling(S2)	$2 \times 2$ /2		$14 \times 14 \times 6$	12
Convolution(C3)	$5 \times 5$ /1	16	$10 \times 10 \times 16$	1516
Sub-sampling(S4)	$2 \times 2$ /2		$5 \times 5 \times 16$	32
Convolution(C5)	$5 \times 5$	120	$1 \times 1 \times 120$	48120
Fully Connected(F6)	$2 \times 2$		$14 \times 14 \times 6$	10164
OUTPUT				84

Table 2. TABLE II : Details of different layers of AlexNet

Layer	filter size/ stride	padding	# filter	output size	#Para meters
Conv-1	$11 \times 11$ /4	0	96	$55 \times 55 \times 96$	34848
pool-1	$3 \times 3$ /2			$27 \times 27 \times 96$
Conv-2	$5 \times 5$ /1	2	256	$27 \times 27 \times 256$	614400
pool-2	$3 \times 3$ /2			$13 \times 13 \times 256$
Conv-3	$3 \times 3$ /1	1	384	$13 \times 13 \times 384$	981504
Conv-4	$3 \times 3$ /1	1	384	$13 \times 13 \times 384$	1327104
Conv-5	$3 \times 3$ /1	1	256	$13 \times 13 \times 256$	884736
pool3	$3 \times 3$ /2			$6 \times 6 \times 256$
FC6				$1 \times 1 \times 4096$	37748736
FC7				$1 \times 1 \times 4096$	16777216
FC8				$1 \times 1 \times 1000$	4096000

Table 3. TABLE III : Comparative performance of different CNN configurations. The + indicates- DenseNet with Bottleneck layer and compression (10 crop testing result).

Name of The CNN	Dataset	Year	Type of CNN	#trained layer	Top-1(val)	Top-5(val)	Top-5(test)
AlexNet	ImageNet	2012	1 CNN	8	40.7%	18.2%
			5 CNN	-	38.1%	16.4%	16.4%
			1 CNN	-	39.0%	16.6%	-
			7 CNN	-	36.7%	15.4%	15.3%
ZFNet	ImageNet	2013	1 CNN	8	38.4 %	16.5%
			5 CNN - (a)	-	36.7 %	15.3%	15.3%
			1 CNN with layers 3, 4, 5: 512, 1024, 512 maps-(b)	-	37.5 %	16.0%	16.1%
			6 CNN, combination of (a) & (b)	-	36.0 %	14.7%	14.8%
VGGNet	ImageNet	2014	ensemble of 7 ConvNets (3-D,2-C & 2-E)	-	24.7%	7.5%	7.3%
			ConvNet- D( multi-crop & dense)	16	24.4 %	7.2%	-
			ConvNet-E (Multi-crop & dense )	19	24.4 %	7.1%	-
			ConvNet-E (Multi-crop & dense )	19	24.4 %	7.1%	7.0%
			Ensemble of multi-scale ConvNets D & E (multi-crop & dense)	-	23.7%	6.8%	6.8%
GoogLeNet	ImageNet	2014	1 CNN with 1 crop	22	-	-	10.07%
			1 CNN with 10 crops	-	-	-	9.15%
			1 CNN with 144 crops	-	-	-	7.89%
			7 CNN with 1 crop	-	-	-	8.09%
			1 CNN with 10 crops	-	-	-	7.62%
			1 CNN with 144 crops	-	-	-	6.67%
ResNet	ImageNet	2015	plain layer	18	27.94%	-
			ResNet-18	18	27.88%	-
			Plain layer	34	28.54%	10.02
			ResNet-34 (zero-padding shortcuts), 10 crop testing -(a)	34	25.03%	7.76
			ResNet-34 (projection shortcuts to increase dimension, others are identity shortcuts ), 10 crop testing-(b)	34	24.52%	7.46%
			ResNet-34 (all shortcuts are projection), 10 crop testing-(c)	34	24.52%	7.46%
			ResNet-50 (with bottleneck layer), 10 crop testing	50	22.85%	6.71%
			ResNet-101 (with bottleneck layer), 10 crop testing	101	21.75%	6.05%
			ResNet-152 (with bottleneck layer), 10 crop testing	152	21.43%	5.71%
			1 ResNet-34 (b)	34	21.84%	5.71%
			1 ResNet-34 (c)	34	21.53%	5.60%
			1 ResNet-50	50	20.74%	5.25%
			1 ResNet-101	101	19.87%	4.60%
			1 ResNet-152	152	19.38%	4.49%
			Ensemble of 6 models	-			3.57%
DenseNet	ImageNet	2016	DensNet-121 +	121	23.61%	6.66%
			DenseNet-169 +	169	22.80%	5.92%
			DenseNet-201 +	201	22.58%	5.54%
			DenseNet-264 +	264	20.80%	5.29%
SENet	ImageNet	2017	SE-ResNet-50	50	23.29%	6.62%
			SE-ResNext-50	50	21.10%	5.49%
			SENet-154 (crop size $320 \times 320 / 299 \times 229$ )	-	17.28%	3.79%
			SENet-154(crop size $320 \times 320$ )	-	16.88%	3.58%

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

MethodsLeNet · Sigmoid Activation · *Communicated@Fast*How Do I Communicate to Expedia? · Average Pooling · Squeeze-and-Excitation Block · Global Average Pooling · Kaiming Initialization · Dense Connections · Max Pooling · Softmax

Full text

Advancements in Image Classification using Convolutional Neural Network

Farhana Sultana

Department of Computer Science

University of Gour Banga

West Bengal, India

Email: [email protected]

Abu Sufian

Department of Computer Science

University of Gour Banga

West Bengal, India

Email: [email protected]

Paramartha Dutta

Department of CSS

Visva-Bharati University

West Bengal, India

Email: [email protected]

Abstract

Convolutional Neural Network (CNN) is the state-of-the-art for image classification task. Here we have briefly discussed different components of CNN. In this paper, We have explained different CNN architectures for image classification. Through this paper, we have shown advancements in CNN from LeNet-5 to latest SENet model. We have discussed the model description and training details of each model. We have also drawn a comparison among those models.

Keywords:

AlexNet, Capsnet, Convolutional Neural Network, Deep learning, DenseNet, Image classification, ResNet, SENet.

I Introduction

Computer vision consists of different problems such as image classification, localization, segmentation and object detection. Among those, image classification can be considered as the fundamental problem and forms the basis for other computer vision problems. Until ’90s only traditional machine learning approaches were used to classify image. But the accuracy and scope of the classification task were bounded by several challenges such as hand-crafted feature extraction process etc. In recent years, the deep neural network (DNN), also entitled as deep learning [1][2], finds complex formation in large data sets using the backpropagation [3] algorithm. Among DNNs, convolutional neural network has demonstrated excellent achievement in problems of computer vision, especially in image classification.

Convolutional Neural Network (CNN or ConvNet) is a especial type of multi-layer neural network inspired by the mechanism of the optical system of living creatures. Hubel and Wiesel [4] discovered that animal visual cortex cells detect light in the small receptive field. Motivated by this work, in 1980, Kunihiko Fukushima introduced neocognitron [5] which is a multi-layered neural network capable of recognizing visual pattern hierarchically through learning. This network is considered as the theoretical inspiration for CNN. In 1990 LeCun et al. introduced the practical model of CNN [6] [7] and developed LeNet-5 [8]. Training by backpropagation [9] algorithm helped LeNet-5 recognizing visual patterns from raw pixels directly without using any separate feature engineering mechanism. Also fewer connections and parameters of CNN than conventional feedforward neural networks with similar network size, made model training easier. But at that time in spite of several advantages, the performance of CNN in intricate problems such as classification of high-resolution image, was limited by the lack of large training data, lack of better regularization method and inadequate computing power.

Nowadays we have larger datasets with millions of high resolution labelled data of thousands category like ImageNet [10], LabelMe [11] etc. With the advent of powerful GPU machine and better regularization method, CNN delivers outstanding performance on image classification tasks. In 2012 a large deep convolution neural network, called AlexNet [12], designed by Krizhevsky et al. showed excellent performance on the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) [13]. The success of AlexNet has become the inspiration of different CNN model such as ZFNet [14], VGGNet [15], GoogleNet [16], ResNet [17], DenseNet [18], CapsNet [19], SENet [20] etc in the following years.

In this study, we have tried to give a review of the advancements of the CNN in the area of image classification. We have given a general view of CNN architectures in section II. Section III describes architecture and training details of different models of CNN. In Section IV we have drawn a comparison between various CNN models. Finally, we have concluded our paper in Section V.

II Convolutional Neural Network

A typical CNN is composed of single or multiple blocks of convolution and sub-sampling layers, after that one or more fully connected layers and an output layer as shown in figure 1.

II-A Convolutional Layer

The convolutional layer (conv layer) is the central part of a CNN. Images are generally stationary in nature. That means the formation of one part of the image is same as any other part. So, a feature learnt in one region can match similar pattern in another region. In a large image, we take a small section and pass it through all the points in the large image (Input). While passing at any point we convolve them into a single position (Output). Each small section of the image that passes over the large image is called filter (Kernel). The filters are later configured based on the back propagation technique. Figure 2 shows typical convolutional operation.

II-B Sub-sampling or Pooling Layer

Pooling simply means down sampling of an image. It takes small region of the convolutional output as input and sub-samples it to produce a single output. Different pooling techniques are there such as max pooling, mean pooling, average pooling etc. Max pooling takes largest of the pixel values of a region as shown in figure 3. Pooling reduces the number of parameter to be computed but makes the network invariant to translations in shape, size and scale.

II-C Fully-connected Layer (FC Layer)

Last section of CNN are basically fully connected layers as depicted in figure 4. This layer takes input from all neurons in the previous layer and performs operation with individual neuron in the current layer to generate output.

III Different Models of CNN for Image Classification

III-A LeNet-5(1998):

In 1998 LeCun et al. introduced the CNN to classify handwritten digit. Their CNN model, called LeNet-5 [8] as shown in figure 5, has 7 weighted (trainable) layers. Among them, three (C1, C3, C5) convolutional layers, two (S2, S4) average pooling layers, one (F6) fully connected layer and one output layer. $Sigmoid$ function was used to include non-linearity before a pooling operation. The output layer used Euclidean Radial Basis Function units (RBF) [21] to classify 10 digits.

In table I we have shown different layers, size of the filter used in each convolution layer, output feature map size and the total number of parameters required per layer of LeNet-5.

III-A1 Dataset used

To train and test LeNet-5, LeCun et al. used the MNIST [22] database of handwritten digits. The database contains 60k training and 10k test data. The input image size of this model is basically $32\times 32$ pixels which is larger than the largest character ( $20\times 20$ pixels) in the database as center part of the receptive field is rich in features. Input images are size normalized and centred in a $28\times 28$ field. They have used data augmentation like horizontal translation, vertical translation, scaling, squeezing and horizontal shearing.

III-A2 Training Details

The authors trained several versions of LeNet-5 using stochastic gradient descent (SGD) [23] approach with 20 iterations for entire training data per session with a decreased rate of global learning rate and a momentum of 0.02. In 1990’s LeNet-5 was sufficiently good. LeNet-5 and LeNet-5 (with distortion) achieved test error rate of 0.95% and 0.8% respectively on MNIST data set.

But as the amount of data, resolution of an image and the number of classes of a classification problem got increased with time, we needed deeper convolutional network and powerful GPU machine to train the model.

III-B AlexNet-2012:

In 2012 Krizhevky et al. designed a large deep CNN, called AlexNet [12], to classify ImageNet [10] data. The architecture of AlexNet is same as LeNet-5 but much bigger. It is made up of 8 trainable layers. Among them, 5 convolutional layers (conv layer) and 3 fully connected layers are there. Using rectified linear unit (ReLU) [24] non-linearity after convolutional and FC layers helped their model to be trained faster than similar networks with $tanh$ units. They have used local response normalization (LRN), called ”brightness normalization”, after the first and second convolutional layer which aids generalization. They have used max-pooling layer after each LRN layer and fifth convolutional layer. In figure 6 architectural details of AlexNet is shown. In table II we have shown different elements of AlexNet.

III-B1 Dataset used

Krizhevsky et al. designed AlexNet for classification of 1.2 million high-resolution images of 1000 classes for ILSVRC - 2010 and ILSVRC - 2012 [25] . There are around 1.2 million/50K/150K training/validation/testing images. On ILSVRC, competitors submit two kinds of error rates: top-1 and top-5.

III-B2 Training Details

From the variable resolution image of ImageNet, AlexNet used down-sampled and centred $256\times 256$ pixels image. To reduce overfitting they have used runtime data augmentation as well as a regularization method called dropout [26]. In data augmentation, they have extracted translated and horizontally reflected 10 random patches of $224\times 224$ images and also used principal component analysis (PCA) [27] for RGB channel shifting of training images. The authors trained AlexNet using stochastic gradient descent (SGD) with batch size of 128, weight decay of 0.0005 and momentum of 0.9. The weight decay works as a regularizer and it reduces training error also. Their initial learning rate was 0.01 reduced manually three times by $1/10$ when value accuracy plateaus. AlexNet was trained on two NVIDIA GTX-580 3 GB GPUs using cross-GPU parallelization for five to six days.

The authors have noticed that removing any middle layer degrades network’s performance. So, the result depends on the depth of the network. Also, they have used purely supervised learning approach to simplify their experiment, but they have expected that unsupervised pre-training would help if we can have adequate computational power to remarkably increase the network size without increasing the amount of the corresponding labelled dataset.

III-C ZFNet

In 2014 Zeiler and Fergus presented a CNN called ZFNet [14]. The Architecture of AlexNet and ZFNet is almost similar except that the authors have reduced 1st layer filter size to $7\times 7$ instead of $11\times 11$ and used stride 2 convolutional layer in both first and second layers to retain more information in those layers’ features. In their paper, the authors tried to explain the reason behind the outstanding performance of large deep CNN. They have used a novel visualization technique which is a deconvolutional network with multiple networks, called deconvnet [28], to map activation at higher layers back to the space of input pixel to recognize which pixels of the input layer is accountable for a given activation in the feature map. Basically, deconvnet is a reversely ordered convnet. It accepts feature map as input and applies unpooling using a switch. A switch is basically the position of maxima within a pooling region recorded during convolution. Then they rectify it using ReLU non-linearity and uses the transpose version of filters to rebuild the activity in the layer below which activates the chosen activation.

III-C1 Training Details

ZFNet used the ImageNet dataset of 1.3 million/50k/100k training/validation/testing images. The authors trained their model following [12]. The slight difference is that they have substituted the sparse connection of layers 3, 4 and 5 of AlexNet with a dense connection in their model and trained it on single GTX-580 GPU for 12 days with 70 epochs. They have also experimented their model with different depths and different filter sizes on Caltech 101 [29], Caltech-256 [30] and PASCAL-2012 [31] data set and shown that their model also generalizes these datasets well.

During training their visualization technique discovers different properties of CNN such as the projections from each layer in ascending order shows that the nature of the features are hierarchical in the network. For this reason, firstly, the upper layers need a higher number of epochs than lower layers to converge and secondly, the network output is stable to translation and scaling. They have used a bunch of occlusion experiments to check whether the model is sensitive to local or global information.

III-D VGGNet

Simonyan and Zisserman used deeper configuration of AlexNet [12], and they proposed it as VGGNet [15]. They have used small filters of size $3\times 3$ for all layers and made the network deeper keeping other parameters fixed. They have used total 6 different CNN configurations: A, A-LRN, B, C, D (VGG16) and E (VGG19) with 11, 11, 13, 16, 16, 19 weighted layers respectively. Figure 8 shows configuration of model D.

The authors have used three $1\times 1$ filters in the sixth, ninth and twelfth convolution layer in model C to increase non-linearity. Also, a pack of three $3\times 3$ convolution layers (with stride 1) has same effective receptive field as one $7\times 7$ convolution layer. So, They have substituted a single $7\times 7$ layer with a pack of three $3\times 3$ convolution layers and this change increases non-linearity and decreases the number of parameters of the network.

III-D1 Training Details

The training procedure of VGGNet follows AlexNet except the cropping and scaling sizes of input image for training and testing. Pre-initialization of certain layers and uses of small filters helps their model to converge after 74 epoch in spite of having a large number of parameters and greater depth. They have trained configuration VGG A with random initialisation. Then using its first 4 convolution layers and last 3 FC layers as pre-initialised layers, they gradually increased the number of weighted layers up to 19 and trained VGG A-LRN to E. They have randomly cropped image to $224\times 224$ from isotropically rescaled training images. They perform horizontal flipping, random RGB colour shifting and scale jittering as data augmentation technique. The scale jittering in train/test phase, the blending of cropped (multi-crop) and uncropped (dense) test images result in better accuracy.

The authors experienced that a deep network with small filters performs better than a shallower one with larger filters. So the depth of the network is important in visual representation.

III-E GoogLeNet

The architecture of GoogLeNet [16], proposed by Szegedy et al., is different from conventional CNN. They have increased the number of units in each layer using parallel filters called inception module [32] of size $1\times 1$ , $3\times 3$ and $5\times 5$ in each convolution layer (conv layer). They have also increased the layers to 22. Figure 10 shows the 22 layers GoogLeNet. While designing this model, they have considered the computational budget fixed. So that the model can be used in mobile and embedded systems. They have used a series of weighted Gabor filters [33] of various size in the inception architecture to handle multiple scales. To make the architecture computationally efficient they have used inception module with dimensionality reduction instead of the naive version of inception module. Figure 9(a) and figure 9(b) are showing both inception modules. Despite 22 layers, the number of parameters used in GoogLeNet is 12 times lesser than AlexNet but its accuracy is significantly better. All the convolution, reduction and projection layers use ReLU non-linearity. They have used average pooling layer instead of the fully connected layers. On top of some inception modules, they have used auxiliary classifiers which are basically smaller CNNs, to combat vanishing gradient problem and overfitting.

III-E1 Training Details

GoogLeNet, a CPU based implementation, was trained using DistBelief [34] distributed machine learning system by using moderate amount of model and data parallelization. They used asynchronous SGD with momentum 0.9 and a constant learning rate schedule. Using different sampling and random ordering of input images, they have trained 7 ensemble GoogLeNet with same initialization. Unlike AlexNet they have used resized image of 4 scales with shorter dimension of 256, 288, 320 and 352 respectively. The total number of crops per image is 4 (scales) $\times 3$ (left, right and centre square/scale) $\times 6$ (4 corner and centre $224\times 224$ crop and the square resized to $224\times 224$ ) $\times 2$ (mirror image of all six crops)=144.

The result of inception architecture has proved that moving towards sparser architecture is realistic and competent idea.

III-F ResNet

He et al. experienced that a deeper CNN stacked up with more layers suffers from vanishing gradient problem. Though this problem is handled by normalized and intermediate initialization, the deeper model shows worse performance on both train and test errors and it is not caused by overfitting. This indicates that optimization of deeper network is hard. To solve this problem the authors used pre-trained shallower model with additional layers to perform identity mapping. So that the performance of deeper network and the shallower network should be similar. They have proposed deep residual learning framework [17] as a solution to the degradation problem. They have included residual mapping ( $H(x)=F(x)+x$ ) instead of desired underlying mapping ( $H(x)$ ) into their network and named their model as ResNet [17].

ResNet architecture consists of stacked residual blocks of $3\times 3$ convolutional layers. They have periodically doubled the number of filters and used a stride of 2. Figure 11(a) and 11(b) shows a plain layer and residual block. As a first layer, they have used a $7\times 7$ conv layer. They have not used any fully connected layers at the end. They have used different depth (34, 50, 101 and 152) ResNet in ILSVRC-2014 competition. For the CNN with depth more than 50 they have used ’bottleneck’ layer for dimensionality reduction and to improve efficiency as GoogLeNet. Their bottleneck design consists of $1\times 1$ , $3\times 3$ and $1\times 1$ convolution layer. Although the 152 Layer ResNet is 8 times deeper than VGG nets, it has lower complexity than VGG nets (16/19).

III-F1 Training Details

To train ResNet, He et al. used SGD with batch size of 128, weight decay of 0.0001 and momentum of 0.9. They have used a learning rate of 0.1 reduced manually two times at 32k and 48k iterations by $1/10$ when value accuracy plateaus and stopped at 64k iterations. They used weight initialization and Batch Normalization after every conv layer. The did not use dropout regularization method.

The experiment of ResNet shows the ability to train deeper network without degrading the performance. The authors have also shown that with increased depth the ResNet, it is easier to optimize and it gains accuracy.

III-G DenseNet

Huang et al. introduced Dense Convolutional Networks (DenseNet) [18], which includes dense block in conventional CNN. The input of a certain layer in a dense block is the concatenation of the output of all the previous layers as shown in figure 12. Here, each layer is reusing the features of all previous layers, strengthening feature propagation and reducing vanishing gradient problem. Also uses of small number of filters reduced the number of parameters as well.

Figure 13 shows a DenseNet with three dense blocks. In a dense block, the non-linear transformation functions are a composite function of batch normalization, ReLU and $3\times 3$ convolution operation. They have also used the $1\times 1$ bottleneck layer to reduce dimensionality.

III-G1 Training Details

Huang et al. trained DenseNet on CIFAR [35], SVHN [36] and ImageNet dataset using SGD with batch size 64 on both CIFAR and SVHN dataset, and with batch size 256 on ImageNet dataset. Initial learning rate was 0.1 and is decreased two times by $1/10$ . They have used weight decay of 0.0001, Nesterov momentum [37] of 0.9 and dropout of 0.2.

On C10 [38], C100 [39], SVHN dataset DenseNet, DenseNet-BC outperforms the error rates of previous CNN architectures. A DenseNet, doubly deeper than ResNet, gives similar accuracy on ImageNet datasets with very less (factor of 2) number of parameters. The authors experienced that DenseNet can be scaled to hundreds layers without optimization difficulty. It also gives consistent improvement if number of parameters increases without degrading performance and overfitting. Also, it requires comparatively fewer parameters and less computational power for better performance.

III-H CapsNet

Conventional CNNs, described above, suffer from two problems. Firstly, Sub-sampling loses the spatial information between higher-level features. Secondly, it faces difficulty in generalizing to novel view points. It can deal with translation but can not detect different dimension of affine transformation. In 2017, Geoffrey E. Hinton proposed CapsNet [19] to handle these problems. CapsNet has components called capsule. A capsule is a group of neurons. So a layer of CapsNet is basically composed with nested neurons. Unlike a typical neural network, a capsule is squashed as a whole vector rather than individual output unit squashing. So scalar output feature detector of CNN is replaced by vector output capsules. Also max-pooling is replaced by ”dynamic routing by agreement” which makes each capsule in each layer to go to the next most relevant capsules at the time of forward propagation.

Architecture of a simple CapsNet is shown in figure 14.

The CapsNet, proposed by Sabour et al, is composed with three layers - two conv layers and one FC layer. First conv layer consist of 256 convolutional unit (CU) with $9\times 9$ kernels of stride 1 and uses ReLU as activation function. This layer detects local features and then sends it to the primary capsules of second layer as input. Each primary capsule contains 8 CU with $9\times 9$ kernel of stride of 2. In total primary capsule layer has $32\times 6\times 6$ 8D capsules. The final layer (DigitCaps) has one 16D capsule per digit class. The authors have used routing between primary layer and DigitCaps layer. As the first convolutional layer is a 1D layer, no routing is used between this layer and primary capsule layer.

III-H1 Training details

Training of CapsNet is performed on MNIST images. To compare the test accuracy, they have used one standard CNN (baseline) and two CapsNets with 1 and 3 routing iterations respectively. They have used reconstruction loss as regularization method. Using a 3 layer CapsNet with 3 routing iterations and with added reconstruction the authors get a test error of 0.25%.

Though CapsNet has shown outstanding performance on MNIST, it may not perform well with large scale image dataset like ImageNet. It may also suffer from vanishing gradient problem.

III-I SENet

In 2017, Hu et al. have designed ”Squeeze-and-Excitation network” (SENet) [20] and have become the winner of ILSVRC-2017. They have reduced the top-5 error rate to 2.25%. Their main contribution is ”Squeeze-and-Excitation” (SE) block as shown in figure 15. Here, $F_{tr}$ : X $\,\to\,$ U is a convolutional operation. A squeeze function ( $F_{sq}$ ) performs average pooling on individual channel of feature map U and produce $1\times 1\times C$ dimensional channel descriptor. An excitation function ( $F_{ex}$ ) is a self-gating mechanism made up of three layers - two fully connected layers and a ReLU non-linearity layer in between. It takes squeezed output as input and produce a per channel modulation weights. By applying the excited output on the feature map U, U is scaled ( $F_{s}cale$ ) to generate final output ( $\widetilde{X}$ ) of SE block.

This SE block can be stacked together to make SENet which generalise different data set very well. The authors developed different SENets including these blocks into several complex CNN models such as VGGNet [15], GoogLeNet [16], ResNext (Variant of ResNet) [40], Inception-ResNet [41], MobileNet [42], ShuffleNet [43].

III-I1 Training Details

The authors have trained and test their model variants on ImageNet, CIFAR-10 and CIFAR-100. They have trained original CNN models and those models with SE blocks, and compare speed accuracy trade-off. They have shown that their models outperform original models by increasing a little bit training/testing time.

IV Comparative Result

In table III, we have shown comparative performance of different CNN (AlexNet to DenseNet) on ImageNet dataset. Top-1 and top-5 error rate on validation dataset and top-5 error rates on test dataset are also shown.

V Conclusion

In this study, we have discussed the advancements of CNN in image classification tasks. We have shown here that although AlexNet, ZFNet and VGGNet followed the architecture of conventional CNN model such as LeNet-5 their networks are larger and deeper. We have experienced that combining inception module and residual blocks with conventional CNN model, GoogLeNet and ResNet gained better accuracy than stacking the same building blocks again and again. DenseNet focused on feature reusing to strengthen the feature propagation. Though CapsNet reached state-of-the-art achievement on MNIST but it is yet to perform as well as previous CNNs performance on high resolution image dataset such as ImageNet. The result of SENet on ImageNet dataset gives us the hope that it may turn out useful for other task which requires strong discriminative features.

Bibliography43

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1[1] Y. Lecun, Y. Bengio, and G. Hinton, “Deep learning,” Nature , vol. 521, no. 7553, pp. 436–444, 5 2015.
2[2] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning . MIT Press, 2016, http://www.deeplearningbook.org .
3[3] R. Hecht-Nielsen, “Theory of the backpropagation neural network,” in International 1989 Joint Conference on Neural Networks , 1989, pp. 593–605 vol.1.
4[4] D. H. Hubel and T. N. Wiesel, “Receptive fields and functional architecture of monkey striate cortex,” Journal of Physiology (London) , vol. 195, pp. 215–243, 1968.
5[5] K. Fukushima, “Neocognitron: A self-organizing neural network model for a mechanism of pattern recognition unaffected by shift in position,” Biological Cybernetics , vol. 36, no. 4, pp. 193–202, Apr 1980. [Online]. Available: https://doi.org/10.1007/BF 00344251 · doi ↗
6[6] Y. Le Cun, B. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. Hubbard, and L. D. Jackel, “Backpropagation applied to handwritten zip code recognition,” Neural Computation , vol. 1, no. 4, pp. 541–551, Dec 1989.
7[7] Y. Le Cun, B. E. Boser, J. S. Denker, D. Henderson, R. E. Howard, W. E. Hubbard, and L. D. Jackel, “Handwritten digit recognition with a back-propagation network,” in Advances in Neural Information Processing Systems 2 , D. S. Touretzky, Ed. Morgan-Kaufmann, 1990, pp. 396–404. [Online]. Available: http://papers.nips.cc/paper/293-handwritten-digit-recognition-with-a-back-propagation-network.pdf
8[8] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, Nov 1998.