Segmentation-Free Recognition Algorithm Based on Deep Learning for Handwritten Text Image

Segmentation-based offline handwritten character recognition algorithms suffered from the segmenting difficulty of interleaving and touching in handwritten manuscripts. To tackle the problem, a segmentation-free recognition algorithm based on deep learning network is proposed in this paper. The network consists of four neural layers, including input layer for image preprocessing, CNNs layer for feature extraction, BDLSTM layer for sequence prediction, and connectionist temporal classification layer for text sequence alignment and classification. Besides, a novel data processing method is performed for data length equalization. Based on this, groups of experiments, based on six typical databases, involved in evaluation indicators of character correct rate, training time cost, storage space cost and testing time cost are carried out. The experimental results show that the proposed algorithm has better performances in accuracy and efficiency than other classical algorithms.


Introduction
The conventional offline optical character recognition (OCR) algorithm has been a hotspot in the field of pattern recognition.It consists of four step, including character image preprocessing, text line segmentation, character segmentation and character recognition.With the development of machine learning, some brilliant classification algorithms [1][2][3][4][5][6] are good enough for single character recognition.Thus, the key step for good recognition results is text segmenting in segmentationbased .Several researches were proposed to settle the character segmentation problem.
The connected-component algorithm [7] has good performances in the non-torched field but cannot nicely split the torched character images.Viterbi-based algorithm [8] and the dripping segmenting algorithm [9] had done well in this task, but both were prone to unsolvable over-segmentation problems.Some other attempts have been made to address this problem by using language model-based recognition algorithms [10][11][12][13].They tried to improve the recognition performances by integrating predefined language models into recognition algorithms.However, the predefined language models usually only considered the short-distance history characters ignoring the long term.
The torched-stroke and over-segmentations still affect the performances of the existing segmentation-based OCR algorithms.Given that, an idea of "segmentation-free" recognition, called sequence labeling as shown in Fig. 1, was given significant attention.In sequence labeling, a text line image can be directly converted to a label sequence.T. H. Su [14] proposed a sequence labeling algorithm based on a hidden Markov model (HMM).It used a sliding window to extract the image features of the handwritten text line and predicted the optimal recognition result by using the HMM.Afterwards, with the development of deep learning, some sequence labeling architectures [15,16] based on deep learning were used for sequence prediction.However, these algorithms assumed the features preextraction as network inputs which became the new problem of the sequence labeling algorithm.
Thereby, some end-to-end algorithms [17][18][19][20][21] were proposed to convert inputs of text line images into outputs of label sequences without the features pre-extraction.The convolutional recurrent neural network (CRNN) [19] succeeded in English text sequence labeling.However, it was prone to the over-fitting problem in some situations with a large alphabet or small databases.
Thus, a sequence labelling algorithm is proposed with the following contributions.The rest of the paper is organized as follows.Section 2 presents the proposed algorithm.Section 3 introduces the experimental environment and databases used in this paper.Four networks based on the proposed algorithm with different parameters and experimental results are discussed in detail in section 4. Section 5 gives a conclusion.

The Proposed Algorithm
The architecture of the proposed network consists of four neural network layers, including preprocessing, CNNs, BDLSTM and CTC loss layers as shown in Fig. 2. The preprocessing layer is used to convert the image into a normalized binary image, which will be further discussed in the experimental databases section.The designed CNNs are used for spontaneous feature extraction and image dimensionality reduction in the second layer of the proposed algorithm.The BDLSTM layer follows the CNNs to predict the sequence output based on the contextual information.The CTC loss layer is used for the alignment between sequences and labels, and the back forward of the loss at the end of the network.Fig. 2. Architectures of the proposed network.

CNNs Layer
The CNNs layer, used for feature extraction and image dimensionality reduction, includes several convolutional layers and pooling layers, but excludes the fully connected layers that are used in the common CNNs to retain the structure information of the input images as completely as possible.Thus, the architecture of the CNNs layer should be designed reasonably by meeting the following two requirements.
(2) The width of CNNs layer outputs: W≥5 ×label length.On that foundation, the outputs of CNNs layer can fit the input dimensional requirements of the BDLSTM layer and retain the structure information at the same time.Suppose that the height of the preprocessed image is H=64, the architecture of CNNs and pooling layers in this paper should be designed as shown in Fig. 3.

BDLSTM Layer
The designed CNNs layer will produce a feature vector with structure semantics, which is invaluable for label prediction.Some networks are good at label prediction.RNNs [6] do well in sequence prediction using the information about the long past inputs.Moreover, the bidirectional recurrent neural networks (BDRNNs) [22] can make good use of both past and later inputs.Based on BDRNNs, it is possible to predict the labels with a suitable feature vector produced by the CNNs layers mentioned above.
However, it is difficult to train the BDRNNs, and it is impossible to retain information for a long period because of the evaporation of gradient information or large increase along the time delay, called the vanishing gradient or the exploding gradient problem [23,24] as shown in Fig. 4(a) in BDRNN network.Meanwhile, the LSTM [25] is a variant of RNNs, which can tackle the problem in RNNs effectively.
The main idea of LSTM is to replace the basic unit in RNNs with a memory unit containing one or more self-connected memory cells and three gates.It is quite feasible to use the long term information about input sequences to predict labels.The gradient transfer in RNNs and LSTM are shown in Fig. 4(b) corresponding to Fig. 4(a).Moreover, BDLSTM network, as shown in Fig. 5, are good at using long term information of past and later input sequences.From then on, suppose that a neural network consists of CNNs, BDLSTM layers described above, is with a part of input image shown in Fig. 1, it will produce results of "rroouunndd" or "roundd", or other possible outputs for the ground truth "round" for the reason that the output of the current network cannot align with the ground truth correctly.From that, the inputs of the CTC algorithm will be compressed with designed rules.The blank character "−" plays a key role in the rule.It makes correct results in some situations as shown in formulas ( 2) and ( 3).
(a) If there is a "−", the result is shown in formula (2).
(b) If there is no "−", the result should be what is shown in formula (3).
In this way, the outputs of the CTC layer can align with the label sequences.

The Forward and Backward Process
Suppose that there is an input image  with a size of 64×600, the label sequence corresponding to  is  , the letter set is  =  ∪ {−}.The forward and backward of the proposed network can be expressed as follow.
where   represents the parameters in CNNs layers and   represents the feature vector outputs as the output of the CNNs layer.The CNNs layer shown in Fig. 2 is corresponding to the third structure described in Fig. 3. Through CNNs layer, the image I will be transformed into a feature vector   with a size of D5×1×75 by the CNNs layers as shown in Fig .2.
(c) BDLSTM layers.The BDLSTM layer is represented by  .The main task of the BDLSTM layers is to predict the label sequence   based on the feature vector   .The forward pass of BDLSTM layers can be expressed as formulas (5) and (6).
= Softmax(  ) (6) where   represents the parameters in BDLSTM layers and   , with a size of 75 × | | , determines the output of the BDLSTM networks.The output should be sent to a  layer to obtain the posterior probability vector   of the label sequence prediction, with a size of 75 × | |, as given in formula (6).
(d) CTC loss layer.The optimal label sequence prediction  of the input image can be estimated by the posterior probability vector   based on CTC loss layers (represented by  ) using formula (7).
2.4.2.Backward Backpropagation through time algorithm (BPTT) by Werbos [27] is used to train the proposed networks in this paper.In addition, to finish the training of networks in an endto-end manner and tackle the alignment problem of sequence prediction, the CTC loss is provided behind the S loss layer.Suppose there is a ground truth dataset, composed of a group of handwritten images I labeled correctly with label sequences l.The dataset is determined by D as given in formula (8).
The CTC loss represented by loss can be calculated using formula (9).
Then, the loss backpropagation to BDLSTM layers can be calculated with the chain rule based on the BPTT algorithm as shown in formula (10).Each partial derivative in formula (10) can be calculated based on the formula derivation by Graves [26].
Similarly, the loss backpropagation to CNNs layers can also be calculated with the chain rule as shown in formula (11). (11) where  represents the parameters of the hidden layers N of BDLSTM networks.
Once the iteration of network training based on one input image is done, the network will converge after some epoch of iterations in the best case.

Experimental Environment
The experimental results in this paper are all based on a desktop PC with an Intel(R) Core(TM) i9-10850K CPU, 16G memory, and an NVIDIA GeForce GTX 1080 with 8G memory.

Experimental Databases
To evaluate the effectiveness of the proposed algorithm, six typical offline handwritten databases were used for the experiments, including the IAM handwriting database [28] for English, the CASIA-HWDB database (including unconstrained text lines (HWDB2.0-2.2) [29] and the HIT-MW database [30] for Chinese, the DigitalPeter database [31] for Russian, the Rodrigo database [32] for Spanish, and the KHATT database [33] for Arabic.All of the databases were correctly labeled.Fig. 6 shows some samples from the databases, while Table 1 lists the properties of the six databases.For better training and testing processing, the images and labels used in this paper were normalized as follows.
(a) normalization of the image height.The height of the input image H was resized to 64 pixels for better efficiency while the information was retained as much as possible.
(b) normalization of image width and label length.While the image height has been normalized, the image width and label length of each individual sample is different.With this, another contribution of this paper is to propose a data processing method to normalize the database into uniform size.
Taking the IAM database as an example, the label length of the samples ranges from 1 to 94 characters.In this paper, every label is normalized to the max label length of 94 characters.The common way for this task is padding "−".For instance, the labels of "A MOVE to stop Mr. Gaitskell from", with the length of 34, will be normalized to new labels In the more extreme cases, a label with length of 1, there are 93 "−" should be padded.It will pull in too much useless information which is disastrous for training.Therefore, the main idea of the data processing method in this paper is to fill the databases with existing samples instead of "−".
Step 1. Collect padding samples.Collecting some samples with a label length ranging from 1 to 51 characters, called padding samples.
Step 2. Get the original data size.Suppose the size of the original image I is (ℎ, ), the length of the original label  is ||.
Step 3. Find a suitable padding sample for the original data.Find a padding sample in which image  with a size of (ℎ,  ) and label  with a length of | | let || + | | = 52.
Step 4. Pad the original data.Produce a new image with a size (ℎ,  +  ) by splicing the original image  with the chosen padding samples  in the horizontal direction and the new label will be produced simply by string slicing.
Step 5. Normalize the image.The image is resized to a uniform size.
The data with a label length of 34 mentioned above can be normalized more reasonably and the new labels of the data should be "A MOVE to stop Mr. Gaitskell from Commonwealth Relations Secretary, that they wish to be kept".The rest of the databases used in this paper were normalized in a similar manner.Moreover, the databases were divided into three parts including 60%, 20% and 20% as the training, validation and testing sets, respectively.

Experimental Results and Discussions
This section explains and discusses the details of the proposed network implementation.The parameters and the sizes of the output of each layer of the proposed network are described in Table 2 where batch size is 16 for network training.Firstly, the preprocessing layer will transform the original image into a binary image with a size of 64 × 2133 and slice the original labels to max label length.The size of outputs would be 16 × 1 × 64 × 2133 while the ℎ = 16.
Then the batches from the layers above will be processed into a feature vector   with the size4 of 16 × 512 × 1 × 513 by CNNs layers.Conv and Pool mentioned in Table 1 represent convolutional and max pooling layers, respectively, where      The experimental results of training loss and testing accuracy of character correct rates (CCRs) on the six databases are shown in Fig. 7 and Table 3.After 20 epochs training, the proposed algorithm obtains good performances on all databases except for the KHATT database which should be caused by the unique typography of the Arabic.Moreover, the differences in the converge speeds in Fig. 8 are produced by the variant sizes for different databases.
Furthermore, three additional variant networks, named Proposal 2, Proposal 3 and Proposal 4, are presented as described in Table 4 based on the network in Table 1 called Proposal 1 hereinafter.Proposal 3 uses more suitable pooling kernels in CNNs layers Proposal 1, retains the structure information of the input images as much as possible and simultaneously shortens the time for each iteration.Proposal 2 and Proposal 4 use more complex BDLSTM by increasing the memory units based on Proposal 1 and Proposal 3 separately to improve the prediction ability of the network.Fig. 8 compares the experimental results of CRNN and Proposal 1, 2, 3 and 4 within 20 epochs.It can be seen that the proposed algorithm has proven more useful than the CRNN in terms of convergence time and CCR.Moreover, Proposal 2 and Proposal 4 converge earlier and make better CCR performances than the other two proposed algorithms due to the more complex BDLSTM layers.On the side of efficiency, the 36DGVS algorithm has the worst performance with a testing time cost of 1.81612s per character and a mode size of 30.96M.From Fig. 9, Proposal 2 and Proposal 4 occupy 161M and 46.7M more spaces than Proposed 1 and Proposal 2, respectively, due to the complexity in BDLSTM layers.However, it makes no significant difference in testing time.Proposal 1, 2, 3 and 4 algorithm take 10.2ms, 13.1ms, 10.8ms and 11.6ms for one letter recognizing on CPU, respectively.Thus, Proposal 2 and Proposal 4 perform better in efficiency than the others except for the longer training time cost.
Overall, Proposal 2 and Proposal 4 prove useful in the task of sequence labeling with good accuracy and efficiency.

Conclusion
A segmentation-free handwritten text recognition algorithm based on deep networks is proposed to tackle the challenges of sequence labeling for handwritten manuscripts with interleaving and touching in this paper.The proposed framework consisted of preprocessing, CNNs, BDLSTM and CTC loss layers.And six representative handwritten databases of different languages including English, Chinese, Russian, Spanish, and Arabic are used for comparison experiments on a group of typical segmentation-based and segmentation-free algorithms.The proposed networks were proved to be useful in the task of sequence labeling, where Arabic with several variant shapes need improved algorithm for better recognition results in the future works.
(a) An improved network based on the CRNN, consisting of four neural network layers, is proposed to deliver better performances and more efficiency in handwritten sequence recognition.(b) A novel databases uniform method, based on data filling, is proposed to normalize the databases into uniform size for better model training.(c) Handwritten recognition experiments in multiple languages are carried out in order to test the robustness network.

2. 4
.1.Forward (a) Image preprocessing.The input image I should be converted to a binary image and the height of the image should be resized into H=64 for the stability and efficiency of the network.By the way, we suppose that there is one grayscale image I with a size of 64×600 as the input of the CNNs in this part.(b) CNNs layers.The CNNs layer is represented with a symbol  .The forward pass of CNNs layers is expressed by formula (4).

(1, 3 * 3 , 1 , 1 )
represents a convolutional layer with a convolutional kernel of 3 * 3, the stride of 1 and the zero padding dimension of 1. (2 * 2, (2,1), (0,1)) represents a max pooling layer with pooling kernel of 2 * 2, the stride vertical of 2 and the stride horizontal of 1, the vertical zero padding dimension of 0 and the horizontal zero padding dimension of 1. Next, the prediction vector   sized 535 × 16 × 1024 is produced from BDLSTM layers with an input of feature vector   , and the posterior probability vector   sized 8560 × | | , where | | represents the length of Chinese letters set in the databases and can be obtained through a  layer based on   .(512, 512, 512) represents a BDLSTM network with 512, 1024 and 1024 units in the input, hidden and output layers because of its directionality, respectively.Lastly, the CTC layers would produce a label prediction sized 832 × 1, where 832 is the product of the label length || and ℎ 16.

Fig. 7 .
Fig. 7. Analysis of experimental results on six Databases.

Table 3 .Fig. 8 .
Character Correct Rates (CCRS) on The Six Databases.Comparative analysis of experimental results for the six databases.(a) Comparison of training loss, (b) Comparison of testing accuracy.

Fig. 9 .
Fig. 9. Comparison of testing accuracy.Histograms of average testing CCR and average testing time cost.

Table 1 .
Properties of the six offline handwritten databases.

Table 2 .
Parameters of The Proposed Networks.This article has been accepted for publication in a future issue of this journal, but it is not yet the definitive version.Content may undergo additional copyediting, typesetting and review before the final publication.Citation information: Ge Peng, Segmentation-Free Recognition Algorithm Based on Deep Learning for Handwritten Text Image, Journal of Artificial Intelligence and Technology (2024), DOI: https://doi.org/10.37965/jait.2024.0473

Table 5 .
Comparisons with state-of-the-arts.

Table 5
Moreover, the average CCR on the KHATT database is 43.18% which is caused by the unique typography of Arabic whose letter has several variant shapes.However, the proposed algorithm achieves more excellent results, 88.67% CCR and 11.43ms character recognition time on average, than others.Specifically, Proposal 1, 2, 3 and 4 achieve average CCRs of 86.23%, 91.13%, 86.87% and 90.44% respectively.The best CCRs for the six databases are 96.69%,98.92%, 98.60%, 97.22%, 99.42%, and 58.02% based on the Proposal 2, 2, 4, 2, 4, and 1, respectively.Proposal 2 and Proposal 4 perform better than the others in sequence labeling accuracy.