TL;DR
This paper introduces an adaptation of Manifold Mixup for handwritten text recognition using CTC loss, demonstrating improved accuracy across multiple languages and datasets.
Contribution
It adapts Manifold Mixup for CTC-based text recognition, enhancing data augmentation techniques in this domain.
Findings
Improved recognition accuracy on various datasets
Effective adaptation of Mixup for CTC loss
Enhanced performance across multiple languages
Abstract
Modern handwritten text recognition techniques employ deep recurrent neural networks. The use of these techniques is especially efficient when a large amount of annotated data is available for parameter estimation. Data augmentation can be used to enhance the performance of the systems when data is scarce. Manifold Mixup is a modern method of data augmentation that meld two images or the feature maps corresponding to these images and the targets are fused accordingly. We propose to apply the Manifold Mixup to text recognition while adapting it to work with a Connectionist Temporal Classification cost. We show that Manifold Mixup improves text recognition results on various languages and datasets.
| Layer | input Depth | output Depth | Filter Size | Stride |
|---|---|---|---|---|
| Tiling | 1 | 4 | 2 2 | 2 2 |
| Convolution | 4 | 8 | 3 3 | 1 1 |
| Convolution | 8 | 16 | 4 2 | 4 2 |
| Gated Conv. | 16 | 16 | 3 3 | 1 1 |
| Convolution | 16 | 32 | 3 3 | 1 1 |
| Gated Conv. | 32 | 32 | 3 3 | 1 1 |
| Convolution | 32 | 64 | 4 2 | 4 2 |
| Gated Conv. | 64 | 64 | 3 3 | 1 1 |
| Convolution | 64 | 128 | 3 3 | 1 1 |
| Max-Pooling | 128 | 128 | 4 1 | 1 1 |
| B-LSTM | 128 | 128 | 1 1 | 1 1 |
| Linear | 128 | 128 | 1 1 | 1 1 |
| B-LSTM | 128 | 128 | 1 1 | 1 1 |
| Linear | 128 | 128 | 1 1 | 1 1 |
| Dataset | Training lines | Validation lines |
|---|---|---|
| Maurdor French Handwritten | 26870 | 2054 |
| Maurdor English Handwritten | 10825 | 1115 |
| Maurdor Arabic Handwritten | 11905 | 1125 |
| RIMES | 10532 | 801 |
| IAM | 6482 | 976 |
| CASIA | 35856 | 5914 |
| CER (%) | |
|---|---|
| No mixup | 9.39 |
| At the input | 9.55 |
| After the 4th convolutional layer | 10.10 |
| After the 8th convolutional layer | 11.60 |
| Randomly at one of the 3 positions | 8.91 |
| Randomly at one of the 3 positions or no fusion | 9.15 |
| CER (%) | |
|---|---|
| Fusion of 2 images | 8.91 |
| Fusion of 3 images | 9.02 |
| CER (%) | |
|---|---|
| With gradient multiplication | 8.91 |
| Without gradient multiplication | 10.08 |
| Distribution | CER (%) |
|---|---|
| Uniform [0,1] (=) | 8.92 |
| Uniform [0.1 , 0.9] | 8.95 |
| 8.91 | |
| 9.12 |
| Dataset | Without mixup | With mixup |
|---|---|---|
| Maurdor French Handwritten | 9.39 | 8.91 |
| Maurdor English Handwritten | 16.0 | 14.8 |
| Maurdor Arabic Handwritten | 11.0 | 10.5 |
| CASIA | 27.5 | 23.9 |
| RIMES | 3.30 | 3.32 |
| IAM | 4.68 | 4.64 |
| Min | Max | Mean | Median | |
|---|---|---|---|---|
| Without mixup | 9.08 | 9.46 | 9.30 | 9.32 |
| With mixup | 8.65 | 9.11 | 8.88 | 8.91 |
| Number of images | With Mixup | Without Mixup |
|---|---|---|
| 26870 (All) | 8.91 | 9.39 |
| 20000 | 9.85 | 10.4 |
| 10000 | 13.3 | 14.4 |
| 5000 | 20.6 | 23.5 |
| Dropout | Manifold Mixup | CER (%) |
|---|---|---|
| No | No | 15.4 |
| Yes | No | 9.39 |
| No | Yes | 10.6 |
| Yes | Yes | 8.91 |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsManifold Mixup · Mixup
Manifold Mixup improves text recognition with CTC loss
Bastien Moysset
A2iA SA, Paris, France
Ronaldo Messina
A2iA SA, Paris, France
**Abstract — Modern handwritten text recognition techniques employ deep recurrent neural networks. The use of these techniques is especially efficient when a large amount of annotated data is available for parameter estimation. Data augmentation can be used to enhance the performance of the systems when data is scarce. Manifold Mixup is a modern method of data augmentation that meld two images or the feature maps corresponding to these images and the targets are fused accordingly. We propose to apply the Manifold Mixup to text recognition while adapting it to work with a Connectionist Temporal Classification cost. We show that Manifold Mixup improves text recognition results on various languages and datasets.
**
I Introduction
Text recognition is an important step in most document image analysis applications. It enables to automatically access the information contained in the pages.
Huge improvement of handwritten text recognition systems has been obtained during the last decade. On the one hand, this amelioration is due to the recognition of text lines using the Connectionist Temporal Classification (CTC) [1] in order to implicitly align the image and the target sequence. On the other hand, this is enabled by modern recurrent neural network techniques, whether based on interleaved convolutional and 2D Long Short-Term Memory (LSTM) layers [2, 3] or on convolutions followed by 1D-LSTM layers [4, 5].
I-A State of the Art
The use of neural networks helped to create systems that can cope with high style heterogeneity within a character class. However, these powerful algorithms with a high number of trained parameters do need a large amount of annotated images to reach an optimal performance.
Several methods have been proposed to reduce the need of annotated data when training the text recognition systems.
First, in the line of what has been proposed in image classification [6], data augmentation can be performed to enlarge the number of training samples. Real images can be slanted and stretched to form new samples [3] or they can be warped with a random grid-based distortion [7].
Secondly, the training set can be extended by adding to it artificial images. This can be done by using handwritten-like fonts [8] or by creating text line images from a recomposition of individually extracted real letter images [9, 10]. Artificial text images can also be synthetized from a recurrent model trained to estimate the ink paths [11]. More recently, Alonso et al. [12] proposed a generative adversarial network (GAN) based technique to synthetize handwritten word images.
On the other hand, the lack of training data should be fought to prevent the network from overfitting. For this, regularization techniques like weight decay or dropout can be applied during the network training [13].
Recently, Mixup strategies have been proposed in the Machine Learning community, and mainly for object classification tasks, with the aim to cope with reduced amount of available data. The common idea of these strategies is to fuse several (usually two) images or their transformations and use the interpolated resulting image as an input for the training.
Inspired by Chawla et al. [14] that interpolate input features of objects of the same class, DeVries et al. [15] proposes to mix samples from the same class after being encoded by some of the neural network layers. Input images of different classes can also effectively be fused if the cost function is adapted so that the network learns to estimate an interpolation of the labels [16] and the mixing strategy itself can be learned to avoid manifold intrusions [17].
Verma et al. [18] merges both these ideas and propose to mix the labels and images from different classes, or their encoding, at various layers in the network. They show improved results on unseen data and resistance to adversarial examples.
We based our proposed technique on this method from Verma et al. and will discuss it more in details in Section II.
I-B Problem statement
In this work, we tackle with a manifold mixup based approach illustrated in Figure 1, the following issues of training neural networks for handwritten text recognition:
- •
The heterogeneity of the handwritten text images to recognize due to varying styles between writers and to different document backgrounds.
- •
A rather small amount of annotated data available making it harder to generalize on unseen images.
- •
A trend to overfitting due to both the upper mentioned items and to the potentially high number of parameters in the network.
We make several contributions in order to create a manifold mixup system that cope with the specificities of the text recognition task (compared to image classification).
- •
We introduce a padding and a width-based grouping strategy within mini-batches in order to handle the varying size of the input images.
- •
We propose a fusion of gradients from two CTC losses in order to mix the two image targets which are label sequences of unequal lengths.
- •
We study the impact of the position where the mixup is done and of other implementation choices on a standard recurrent text recognition network.
- •
We prove the effectiveness of the proposed method on a set of handwritten databases in several languages and with varying sizes.
We will describe in details the manifold mixup strategy in Section II. The adaptation to a CTC recognition training will be explained in Section IV. Finally, experimental results will be shown in Section V.
II Manifold Mixup
The Manifold Mixup strategy for training classifiers was introduced by Verma et al. [18]. It is related to the input image Mixup strategy of Zhang et al. [16]. The main idea of these methods is to, during training, do a randomly weighted interpolation of two images (or of the transformation of these images) and of their respective labels. The Figure 2 illustrates this interpolation, both within the input image space and in a feature map space obtained after forwarding the images through the first layers of the network.
The idea behind Mixup training is the following: in the image to labels space, we want to model the manifold that transforms any image into its label. For this, we use (input, label) data that correspond to points in this space. Because the number of trainable parameters is usually higher than the number of training data available, this setup is prone to overfitting. Using Mixup means that we do not only learn from the data points, but from all the segments linking any two data couples.
More formally, consider a neural network made of stacked layers. For a layer of the neural network, let be the function calculated by the layers of the neural network up to this layer . Similarly, let be the function made by the layers from this layer . If is the function corresponding to the full network, it means that for an input image , we have:
[TABLE]
The value is chosen randomly between several possible values. The Mixup function , for a random weighting value , is applied to two training data input-label couples () and (), so that:
[TABLE]
The loss we optimize for is a function of an input and a label where the estimated output is . For a classification problem with classes, this loss is a cross entropy:
[TABLE]
For Mixup, the loss is computed from the output of the network and the weighted interpolation of the two labels.
[TABLE]
[TABLE]
We can observe that this loss is the weighted sum of the individual losses:
[TABLE]
And that the gradients are such that:
[TABLE]
III Text line recognition
III-A Gated Convolutional Network
For the text line recognition problem, we use a Gated Convolutional Network (GNN) inspired by Bluche et al. [5]. The grayscale input images are isotropically rescaled to a fixed height of 128 pixels, normalized and passed through a 2 2 tiling. The network is composed of 12 layers, as indicated in Table I and in Figure 3. The first 8 layers are convolutional and some of them are used as gates which means that their outputs are pointwise multiplied with their inputs. On top of this convolutional encoder, a vertical max-pooling is applied and a recurrent decoder, comprising two Bidirectional Long Short-Term Memory (LSTM) and two linear layers, is applied to the obtained one dimensional signal. The output has the depth of the labelset size (including the blank) and a width proportional to the width of the input image.
III-B Training with the CTC
The text line recognition problem has some specificities in comparison to classification. The output of the network is not a vector of probabilities, but a sequence of vectors of probabilities where the sequence size is varying and proportional to the width of the text line image. Similarly, the target is not a label but a sequence of labels where the label sequence size corresponds to the number of characters in the text line.
Because and are both different and varying independently (), we need to perform some sort of alignment between the predictions and the targets. This alignment is implicitly made by the Connectionist Temporal Classification (CTC) [1].
If we define as the function that transform a given label sequence in all the possible sequences, including blanks, of size , the CTC loss is defined as:
[TABLE]
This cost can be computed efficiently using dynamic programming alongside the gradients that will be backpropagated in the network.
[TABLE]
IV Applying Mixup to text recognition
In this work, we aim at applying a mixup strategy to the sequence recognition problem. It means that we want, at a given level , to fuse the feature maps and from two text line input images and . In practice, we choose randomly within {0,4,8} as illustrated in Figure 3.
Because and may have different horizontal sizes, we pad (with white) all the images from a mini-batch to the max horizontal size within the mini-batch. In order to save computing time and to get more coherent mixups, we group within a same mini-batch images of similar width in the line of what is done in bucketing techniques [19]. Images are weighted with a random valued ratio. In practice, is drawn from a distribution. After going through the end of the network, we obtain a common sequence of prediction probabilities for both lines.
Nevertheless, interpolating the targets like it is done in equation 4 is not possible due to the fact that the two label sequences have different lengths and because the forward-backward algorithm used to compute the CTC works well only for one-hot encodings of targets.
For this reason we choose to directly use as a loss the weighted sum of two CTC losses, one with each label sequence and .
[TABLE]
We note that this weighting of losses is analogous to what was observed for the classification in Equation 6 and, like in Equation 7, we get gradients relatively to the weighted gradients from the two targets.
[TABLE]
V Experiments
V-A Experimental setup
We performed most of the experiments on the handwritten lines in French from the Maurdor dataset [20]. This dataset has the particularity of being very challenging with heterogeneous images from different kind of documents (forms, letters, drawings, …) and various scanning procedures. We also made control experiments on other handwritten sub-datasets from Maurdor (English and Arabic), on the easier tasks RIMES [21] and IAM [22] and on the Chinese CASIA [23] dataset (we do not use the isolated characters parts). The statistics regarding the number of lines in the used datasets are available in Table II.
To assess the performances of the models, we compute a character error rate (CER) as the Levenshtein distance between the predicted and the ground truth sequences. Excepted for the Chinese, where a language model is used to recover the characters from an encoding as detailed in Bluche et al. [24], we do not use any language model and only the agglutinated best predictions are considered.
For all the models, we use a Glorot [25] initialization of the weights, RMSProp [26] based gradient descent, mini-batches of size 8, and a learning rate of . Models are selected with an early stopping on the validation set after 200 epochs without improvement.
V-B Ablation study
In this section, we validate and discuss the key design choices we made for our system.
V-B1 Impact of the position of the Mixup
First, we compare, in Table III, the performance of our handwriting recognition system with respect to the position where the mixup of the feature maps is performed. We observe that, contrarily to what was observed by Zhang et al. [16] for several tasks, the performances with input mixup are worse than the baseline (no mixup) model. Accordingly to this same paper [16], our recognition performance decreases if the mixup is performed in higher latent spaces.
However, we observe that performing the mixup randomly at one of several positions in the network, like what was done in Verma et al. [18], helps to improve the results compared to the no-mixup strategy. This is probably due to the fact that by adding some randomness, we force the network to learn with the interpolations and prevent it from disentangling the signals. No improvement was obtained by training both with and without mixup.
V-B2 Number of images mixed-up
As shown in Table IV, we do not observe further improvement of the results when mixing 3 images instead of 2.
V-B3 Multiplication of the gradients by the Mixup ratio
A key component of our system is the choice to multiply the gradients (or the loss, this is the same) by the same random value that is used to weight the two feature maps during the mixup. To confirm the validity of this choice, we train the network without this multiplication of the gradients by . It means that the network is trained to recognize both label sequences without taking into account the weighting ratio. Results, found in Table V, show that doing the multiplication by is indeed very important.
V-B4 Impact of the mixup ratio distribution.
This parameter is drawn from a distribution, as in Zhang et al. [16]. Verma et al. [18] use a distribution. In Table VI, we compare the results with both these distributions alongside uniform distributions. No significative differences can be observed between the distributions.
V-C Analysis of Mixup results
In order to prove the robustness of the presented manifold mixup approach, we compare the performances on the list of datasets introduced in Section V-A. Results can be found in Table VII. We observe a consistent decrease of the character error rates on the difficult datasets that are the three datasets from Maurdor and on the CASIA dataset. Results on the more simple RIMES and IAM datasets are similar with and without mixup.
The significativity of the improvement observed when adding the manifold mixup is addressed in Table VIII by training 10 different networks (meaning that 10 different random seeds are used for initialization), both with and without Mixup, on the Maurdor French Handwritten dataset.
We also compare, in Table IX, the impact of the amount of training data available on the performances. We observe that the relative improvement is progressively increasing from 5.1% to 12.3% when the dataset size is reduced from 26 000 to 5 000 examples. This illustrates how using manifold mixup for text recognition does increase the generalization ability of the network.
Similarly, we observe that using mixup strongly diminishes the overfitting. In particular, we observe that the CTC loss on the validation set tends to go up at some point without manifold mixup, while it reaches some sort of horizontal asymptote when mixup is performed. This confirms that the mixup approach has some regularization effect on the network as mentioned in previous works. [16, 17, 18].
Following this line of reasoning, we study, in Table X the cross impact of an other common layer with a regularization effect, the Dropout [27], that is used in our training process. We can see that both of them, independently strongly improve the results of the text recognition. But their improvements are not completely co-linear as using both of them further improve the performances.
VI Conclusion
In this paper, we presented a new training strategy for text recognition systems based on Manifold Mixup. We proved that this technique acts as a strong regularizer by interpolating the input images and the gradients. We proposed a technique to apply the mixup to varying size images and sequence prediction with Connectionist Temporal Classification alignments. We demonstrated a significative improvement of text recognition results on several handwritten datasets of varying sizes and languages.
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] A. Graves, S. Fernandez, F. Gomez, and J. Schmidhuber, “Connectionist Temporal Classification: Labelling unsegmented sequence data with recurrent neural nets,” in International Conference on Machine Learning , 2006.
- 2[2] A. Graves and J. Schmidhuber, “Offline handwriting recognition with multidimensional recurrent neural networks,” in Advances in Neural Information Processing Systems , 2009.
- 3[3] B. Moysset, T. Bluche, M. Knibbe, M. F. Benzeghiba, R. Messina, J. Louradour, and C. Kermorvant, “The A 2i A multi-lingual text recognition system at the second Maurdor evaluation,” in International Conference on Frontiers in Handwriting Recognition , 2014.
- 4[4] J. Puigcerver, “Are multidimensional recurrent layers really necessary for handwritten text recognition?” in International Conference on Document Analysis and Recognition , 2017.
- 5[5] T. Bluche and R. Messina, “Gated convolutional recurrent neural networks for multilingual handwriting recognition,” International Conference on Document Analysis and Recognition , 2017.
- 6[6] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification with deep convolutional neural networks,” in Advances in Neural Information Processing Systems , 2012.
- 7[7] C. Wigington, S. Stewart, B. Davis, B. Barrett, B. Price, and S. Cohen, “Data augmentation for recognition of handwritten words and lines using a cnn-lstm network,” in International Conference on Document Analysis and Recognition , 2017.
- 8[8] M. Helmers and H. Bunke, “Generation and use of synthetic training data in cursive handwriting recognition,” in Iberian Conference on Pattern Recognition and Image Analysis , 2003.
