Handwritten Indic Character Recognition using Capsule Networks
Bodhisatwa Mandal, Suvam Dubey, Swarnendu Ghosh, Ritesh Sarkhel,, Nibaran Das

TL;DR
This paper demonstrates that capsule networks outperform traditional CNNs like LeNet and AlexNet in handwritten Indic character recognition, showing improved accuracy and the ability to enhance other models' performance.
Contribution
The study applies capsule networks to handwritten Indic characters, showing their superiority and ability to boost existing CNN architectures.
Findings
Capsule networks outperform LeNet in Indic character recognition.
Capsule networks can enhance the performance of CNNs like LeNet and AlexNet.
Capsule networks demonstrate better spatial invariance than traditional CNNs.
Abstract
Convolutional neural networks(CNNs) has become one of the primary algorithms for various computer vision tasks. Handwritten character recognition is a typical example of such task that has also attracted attention. CNN architectures such as LeNet and AlexNet have become very prominent over the last two decades however the spatial invariance of the different kernels has been a prominent issue till now. With the introduction of capsule networks, kernels can work together in consensus with one another with the help of dynamic routing, that combines individual opinions of multiple groups of kernels called capsules to employ equivariance among kernels. In the current work, we have implemented capsule network on handwritten Indic digits and character datasets to show its superiority over networks like LeNet. Furthermore, it has also been shown that they can boost the performance of other…
| Network |
|
|
Stride | Padding |
|
|
|
|
|
|
|||||||||||||||||||
| Convolution | 9 | 1 | 0 | 1 | 256 | NA | NA | 20,992 | NA | ||||||||||||||||||||
| Primary Caps | 9 | 2 | 0 | 256 | 32 | 1 | 8 | 5,308,672 | 0 | ||||||||||||||||||||
| Capsule | Digit Caps | NA | NA | NA | 1,152 | 10 | 8 | 16 | 1,474,560 | 11,520 | |||||||||||||||||||
| Networks | Decoder FC | NA | NA | NA | 160 | 512 | NA | NA | 82,432 | NA | |||||||||||||||||||
| Decoder FC | NA | NA | NA | 512 | 1,024 | NA | NA | 525,312 | NA | ||||||||||||||||||||
| Decoder FC | NA | NA | NA | 1,024 | 784 | NA | NA | 803,600 | NA | ||||||||||||||||||||
| Total number of parameters | 8,227,088 | ||||||||||||||||||||||||||||
| Architectures |
|
|
|
|
|
|
|||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| LeNet | 94.6 | 92.1 | 95.8 | 63.0 | 75.4 | 84.18(14.42) | |||||||||||||||
| AlexNet | 97.65 | 96.3 | 97.4 | 95.9 | 94.2 | 96.29(1.39) | |||||||||||||||
| CapsNet | 97.35 | 94.8 | 96.2 | 90.6 | 79.3 | 91.65(7.36) | |||||||||||||||
| LeNet+AlexNet | 97.6 | 96.2 | 97.3 | 95.8 | 92.6 | 95.9(1.99) | |||||||||||||||
| LeNet+CapsNet | 95.45 | 94 | 96.2 | 88.4 | 79.9 | 90.79(6.81) | |||||||||||||||
| AlexNet+CapsNet | 97.75 | 96.6 | 97.6 | 96.2 | 94.4 | 96.51(1.35) | |||||||||||||||
| All_Combined | 97.5 | 96.5 | 97.8 | 96.1 | 75.2 | 92.62(9.76) |
| Dataset | Our approach | Accuracy | Other approaches | Accuracy |
|---|---|---|---|---|
| Bangla Digits | AlexNet + CapsNet | 97.75 | Basu et al [6]. | 96.67 |
| Roy et al. [7] | 95.08 | |||
| Roy et al. [8] | 97.45 | |||
| Devanagari Digits | AlexNet + CapsNet | 96.60 | Das et al. [9] | 90.44 |
| Roy et al. [8] | 96.50 | |||
| Telugu Digits | AlexNet + CapsNet + LeNet | 97.80 | Sarkhel et al. [10] | 97.50 |
| Roy et al.[8] | 87.20 | |||
| Bangla Basic Characters | AlexNet + CapsNet | 96.20 | Sarkhel et al. [10] | 86.53 |
| Bhattacharya et al. [11] | 92.15 | |||
| Bangla Compound Characters | AlexNet + CapsNet | 94.40 | Roy et al. [12] | 90.33 |
| Pal et al. [13] | 93.12 | |||
| Sarkhel et al. [14] | 86.64 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsCapsule Network · 1x1 Convolution · Convolution · Local Response Normalization · Grouped Convolution · *Communicated@Fast*How Do I Communicate to Expedia? · Dropout · Dense Connections · Max Pooling · LeNet
Handwritten Indic Character Recognition using Capsule Networks
Bodhisatwa Mandal1,Suvam Dubey1, Swarnendu Ghosh1, RiteshSarkhel2, Nibaran Das1
1Dept. of CSE, Jadavpur University, Kolkata, 700032, WB, India.
{bodhisatwam,suvamdubey}@gmail.com, swarnendughosh.cse.rs,[email protected]
2Dept. of CSE, Ohio State University, Columbus, OH 43210, USA
Abstract
Convolutional neural networks(CNNs) has become one of the primary algorithms for various computer vision tasks. Handwritten character recognition is a typical example of such task that has also attracted attention. CNN architectures such as LeNet and AlexNet have become very prominent over the last two decades however the spatial invariance of the different kernels has been a prominent issue till now. With the introduction of capsule networks, kernels can work together in consensus with one another with the help of dynamic routing, that combines individual opinions of multiple groups of kernels called capsules to employ equivariance among kernels. In the current work, we have implemented capsule network on handwritten Indic digits and character datasets to show its superiority over networks like LeNet. Furthermore, it has also been shown that they can boost the performance of other networks like LeNet and AlexNet.
Index Terms:
Capsule Network, Convolutional Neural Network, Image Classification, Classifier Combination, Deep Learning
I Introduction
It has been two decades since the first convolutional neural networks was introduced in 1998 [1] for handwritten digit classification problem. Since then computer vision has matured a lot in terms of both the complexity of the architectures as well as the difficulty of the challenges they address. Many works have been introduced in later years to address challenges like object recognition [2]. However through all these years one principal issue was yet to be addressed. Convolutional Neural Networks by its nature employ invariance of features against their spatial position. As the kernels that represent specific features are convolved throughout the entire image, the amount activation is position invariant. The activations across different kernels do not communicate with each other and hence their outputs are spatially invariant. We have intuitively developed the skill to analyze relative positions of various parts of an object. To learn these relations the capsule networks were proposed [3]. In short capsule networks consist of a group of kernels that work together and pass information to next layers through a mutual agreement that is achieved by dynamic routing of information during the forward pass. In our experiments, our goal is to analyze the performance of these networks for some Indic digit and character datasets. We have used three of the most popular Indic digits dataset namely,Devanagari digits, Bangla digits, and Telugu digits. While these have only 10 classes, the two character datasets, namely, Bangla basic character and Bangla compound characters have 50 and 199 classes respectively. There have been many works in Indic datasets using CNNs before [4, 5]. In our experiments, the performance of the capsule networks are compared with respect to LeNet and AlexNet. To show that the capsule network learns unique concepts, we have combined it with other networks to show a boost in performance. In the next section, a refresher is provided as to the basics of a simple CNN. In section 3 it is shown how the capsule network evolves over the simple CNN along with explanations regarding its internal mechanisms. In section 4, the experimentations and results are discussed and finally concluding in section 5.
II CNN Refresher
Convolutional neural networks are typically designed as a series of 2-d convolution and pooling operations along with non-linear activations in the middle followed by a fully connected network for classification. A 2-d convolution is performed by convolving a kernel over an input of size where and are height, width and number of input channels of the input . A kernel convolution on such an input should be of a shape . Here are the depth, height and width of the kernel. Note that the depth of the kernel is equal to the input number of channels. If we use number of such kernels, then the output tensor generated shall be of the shape . The height and width are dependent on factors like input height , input width , stride of the kernel and the padding of the input. Convolutions are typically followed by non-linear activations such as a sigmoid, Tanh, or a rectified linear units. Pooling operations normally take a small region of the input and compresses it to a single value by taking either maximum(max pooling) or average(average pooling) of the corresponding activations. This reduces the size of the activation maps. Hence when kernels convolve over this tensor, it actually corresponds to a larger area in the original image. After a series of convolutions, activations and pooling, we obtain a tensor signifying the extracted features of the image. This tensor is flattened to form a linear vector of shape , which can be fed as an input to a fully connected network. Here and are the depth, height, and width of the tensor to be flattened. The total number of neurons in this layer is . At the end of the fully connected network we get a vector of size that corresponds to the output layer. A loss such as mean-square error, or cross entropy, or negative log likelihood is computed which is then back-propagated to update the weights using optimizers like stochastic gradient descent or adaptive moments. A schematic diagram is shown in fig.1, with a typical convolution operation followed by a fully connected layer to perform classification. Layers such as pooling and non-linearities are not shown to keep simplicity and to keep the diagram analogous to fig. 2.
III The Capsule Network
The primary concern with CNNs are that the different kernels work independently. If two kernels are trained to activate for two specific parts of an object they will generate the same amount activations irrespective of the relative positions of the object. Capsule networks brings a factor of agreement between kernels in the equation. Subsequent layers receive higher activations when kernels corresponding to different parts of the object agree with the general consensus. The capsule network proposed in [3] consist of two different capsule layers. A primary capsule layer that groups convolutions to work together as a capsule unit. This is followed by a digit capsule layer that is obtained by calculating agreement among different capsules through dynamic routing. A schematic diagram of capsule network is provided in fig. 2. The diagram does not represent the actual architecture proposed in the original work [3], rather it demonstrates a primary and digit capsule layer. The diagram drawn is kept analogous to a typical CNN shown in fig.1 to highlight the major differences.
III-A Primary Capsules
The capsule network starts with typical convolution layer that converts the input image into a block of activations. This tensor is fed as an input into the primary capsule layer. If the number of channels in this input is and the desired dimension of primary capsules is then the shape of one kernel is . and are the height and width of the kernel. With number of such kernels we shall get an output of shape , where The height and width is dependent on factors like input height , input width , stride of the kernel and the padding of the input. Unlike normal convolutions, where each activation tensor had a depth of , the depth of the activations in primary capsules is . The total number of primary capsules is . Before passing to the next layer this tensor is reshaped into .
III-B Digit Capsules
Normally output layers in a fully connected network is of the shape , where is the number of classes. The capsule network replaced the output layers with a digit capsule layer. Each class is represented by a capsule of dimension . Hence we get a digit capsule block of shape . By calculating the L2 Norm of each row we get our output layer of shape . The values of digit capsules are calculated by dynamic routing between primary capsules.
III-C Dynamic routing
The dynamic routing [3] is computed to obtain the digit capsules from the primary capsules. Two different types of weights are required to perform dynamic routing. Firstly we need the weights to calculate individual opinions of every capsule. These weights, are normally trained using back-propagation. If is the index of the primary capsules of dimension and is the index of the digit capsules of dimension. is of shape . The individual opinion of regarding the digit capsule is given by,
[TABLE]
where is the primary capsule. So for each capsule we get an individual digit capsule block of shape . The second type of weight can be called the routing weights (). The routing weights are used to combine these individual digit capsules to form the final digit capsules. These routing weights are updated on during the forward pass based on how much the individual digit capsules agree with the combined one. The routing weight matrix is of the shape of . During each forward pass the routing weights are first initialized as zeros. The coupling coefficients is given by,
[TABLE]
The coupling coefficients are used to combine the individual digit capsules and form the combined digit capsule. The combined digit capsule is given by,
[TABLE]
A squashing function stretches the values of such that bigger values go close to one and lower values go close to zero. The squashed combined digit capsule is given by,
[TABLE]
The agreement between individual digit capsules and the squashed combined digit capsules can be calculated using a simple dot product. The more the value of the agreement the more preference is awarded to the corresponding capsule in the next routing iteration. This is obtained by updating the as,
[TABLE]
Equations 2-4 are repeated for a specific number of routing iterations to perform iterative dynamic routing of opinions of primary capsules to form the digit capsule.
III-D Loss Function
The loss function used for capsule networks is a marginal loss for the existence of a digit. The marginal loss for digit is given by,
[TABLE]
Here, iff a digit of class is present. The upper and lower bounds and are set to 0.9 and 0.1 respectively. is set as 0.5
III-E Regularization
Proper regularization of a network is essential to stop models from over-fitting the data. In case of capsule networks a parallel decoder network is connected with the obtained digit capsules as its input. The decoder tries to reconstruct the input image. A reconstruction loss is also minimized along with margin loss so that the network does not over-fit the training set. However the reconstruction loss is scaled down by a factor of 0.0005 so that the margin loss is not dominated.
IV Experimentations and Results
Our experiments focus on the implementation of capsule networks for handwritten Indic digits and character databases. The results have been compared with other famous CNN architectures like LeNet and AlexNet. While LeNet was built for smaller problems like digit classification, AlexNet was intended for much more complicated data like the ImageNet. The input image is resized to the native size supported by the network that is for capsule networks, for LeNet and for AlexNet. In total 7 models have been tested on 5 datasets. Firstly, the basic LeNet, AlexNet and capsule network was tested. Second set of experiments involved an ensemble of two of the three networks using a probabilistic averaging. Finally all the three networks were combined by averaging the output probability distribution. All the results are tabulated in table 2.
IV-A Datasets
We have used five datasets for our experiments. Firstly we have Indic handwritten digit databases(CMATERdb) 111https://code.google.com/archive/p/cmaterdb/ in three scripts that is Bangla(CMATERdb 3.1.1), Devanagari(CMATERdb 3.2.1) and Telugu(CMATERdb 3.4.1). These are a typical 10 class problems to primarily challenge the performance of LeNet. Subsequently, the character databases namely, Bangla basic characters(CMATERdb 3.1.2) and Bangla compound characters(CMATERdb 3.1.3.3) give us a 50 class and a 199 class problem to deal with. The description of the datasets are given below. All the datasets were split into train and test set in the ratio 2:1. The accuracies provided are with respect to the best model in terms of training accuracy.
IV-B Architecture and Hyperparameters
The capsule network has been used as it has been proposed in [3]. The performance is compared with respect to LeNet and AlexNet. The specifics of the capsule network architecture is provided in table 1. The LeNet was primarily built for MNIST digit classification with only around 61K trainable parameters. The AlexNet has around 57 million trainable parameters so that it can tackle harder problems. Like LeNet, the capsule network was also proposed for MNIST digit classification, however it is much more robust. It has around 8.2 million parameters out of which around 11K parameters are trained on the runtime by dynamic routing. All the provided statistics is with respect to a single channel input of native input size and a 10 class output. All networks are optimized with Adam optimizer with an initial learning rate of 0.001, eps of 1e-08 and beta values as 0.9 and 0.999. The experiments were carried out using a Nvidia Quadro P5000 with 2560 CUDA cores and 16 GB of VRAM.
IV-C Result and Analysis
The result of the experiments have been tabulated in table 2. It can be clearly seen that capsule networks(written as CapsNet) surpasses LeNet in case of every dataset used. AlexNet being almost 7 times larger network as compared to capsule networks performs better than capsule networks. However the difference in performance is much more visible in case of the character datasets with much higher number of classes as compared to digits. LeNet fail poorly for the character datasets. Capsule network proved to be much more robust against complex data with higher number of classes. Upon combination we can see that combining LeNet with AlexNet is detrimental in nature with respect to AlexNet alone for every dataset. However combining capsule networks have always shown a positive effect. This proves that capsule network are capable of extracting some information that even AlexNet fails to obtain. For most datasets the best performance was achieved by combining AlexNet with capsule networks except for Telugu digits, where combination of all three networks proved to be the best. Furthermore we have analyzed the rise of test accuracy with every epoch of training. It can be seen that the capsule networks have the steepest slope signifying that they have the fastest learning curve. Finally in table 3 we have compared the obtained result against some state of the art works performed on the datasets.
In terms of computational complexity, the extra computational overhead is during the dynamic routing phase. Other than training by backpropagation, for every sample the routing weights must also be tuned for times, where is the number of routing iterations. During each iteration number of coupling coefficients must be calculated. Further an weighted sum over dimension is needed to compute the combined digit capsule. Finally number of routing weights must be tuned using agreement of individual and combined digit capsules. With all these, capsule networks generally have quite slow iterations, but as evident from Fig. 3 it also learns much faster as compared to LeNet and AlexNet.
V Conclusion
In our current work we have implemented the capsule networks on handwritten Indic digits and character databases. We have shown that capsule networks are much superior and robust compared to the LeNet architecture. We have also seen that capsule networks can act as a booster when combined with other networks like LeNet and AlexNet. The best performance was achieved by combining AlexNet with capsule networks for most of the datasets. Only in case of Telugu dataset, combination of all three networks worked the best. From the results it can be concluded that even with 7 times more parameters that capsule networks, the AlexNet failed to capture some information that the capsule network learnt. Thus it was able to improve the performance of AlexNet. Finally it has also been seen the capsule network converge much faster that LeNet or AlexNet. In terms of pros and cons, the use of capsule networks can be beneficial for learning with much lesser number of features and also as improvement technique for other bigger networks. The problem with capsule network is its slow iterative process and limitation to single layer routing. That reveals many avenues of research.
Acknowledgment
This work is partially supported by the project order no. SB/S3/EECE/054/2016, dated 25/11/2016, sponsored by SERB (Government of India) and carried out at the Centre for Microprocessor Application for Training Education and Research, CSE Department, Jadavpur University.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] Y. Le Cun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning applied to document recognition,” Proceedings of the IEEE , vol. 86, no. 11, pp. 2278–2324, 1998.
- 2[2] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” pp. 1097–1105, 2012.
- 3[3] S. Sabour, N. Frosst, and G. E. Hinton, “Dynamic routing between capsules,” pp. 3856–3866, 2017.
- 4[4] S. Ukil, S. Ghosh, S. M. Obaidullah, K. Santosh, K. Roy, and N. Das, “Deep learning for word-level handwritten indic script identification,” ar Xiv preprint ar Xiv:1801.01627 , 2018.
- 5[5] R. Sarkhel, N. Das, A. Das, M. Kundu, and M. Nasipuri, “A multi-scale deep quad tree based feature extraction method for the recognition of isolated handwritten characters of popular indic scripts,” Pattern Recognition , vol. 71, pp. 78–93, 2017.
- 6[6] S. Basu, N. Das, R. Sarkar, M. Kundu, M. Nasipuri, and D. K. Basu, “An mlp based approach for recognition of handwritten bangla numerals,” ar Xiv preprint ar Xiv:1203.0876 , 2012.
- 7[7] A. Roy, N. Mazumder, N. Das, R. Sarkar, S. Basu, and M. Nasipuri, “A new quad tree based feature set for recognition of handwritten bangla numerals,” pp. 1–6, 2012.
- 8[8] A. Roy, N. Das, R. Sarkar, S. Basu, M. Kundu, and M. Nasipuri, “An axiomatic fuzzy set theory based feature selection methodology for handwritten numeral recognition,” pp. 133–140, 2014.
