READ-BAD: A New Dataset and Evaluation Scheme for Baseline Detection in Archival Documents
Tobias Gr\"uning (1), Roger Labahn (1), Markus Diem (2), Florian, Kleber (2), Stefan Fiel (2) ((1) University of Rostock - CITlab, (2) TU Wien, - Computer Vision Lab)

TL;DR
This paper introduces READ-BAD, a diverse dataset of archival documents, and a new baseline-based evaluation scheme for text line detection that handles skewed and rotated text without binarization.
Contribution
It provides a novel dataset with varied layouts and degradations, and proposes an evaluation method that simplifies assessment of text line detection algorithms.
Findings
New dataset with 2036 archival images and diverse layouts.
Evaluation scheme that does not require binarization and handles skewed/rotated text.
Results demonstrating the effectiveness of the proposed evaluation scheme.
Abstract
Text line detection is crucial for any application associated with Automatic Text Recognition or Keyword Spotting. Modern algorithms perform good on well-established datasets since they either comprise clean data or simple/homogeneous page layouts. We have collected and annotated 2036 archival document images from different locations and time periods. The dataset contains varying page layouts and degradations that challenge text line segmentation methods. Well established text line segmentation evaluation schemes such as the Detection Rate or Recognition Accuracy demand for binarized data that is annotated on a pixel level. Producing ground truth by these means is laborious and not needed to determine a method's quality. In this paper we propose a new evaluation scheme that is based on baselines. The proposed scheme has no need for binarization and it can handle skewed as well as…
| – | – | |||
| – | – | |||
| Ex. | R | P | |||
|---|---|---|---|---|---|
| 1 | |||||
| 2 | |||||
| 3 | |||||
| 4 | |||||
| 5 |
| Track | GT lines | HY lines | R | P | |
|---|---|---|---|---|---|
| Simple | |||||
| Complex |
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
READ-BAD: A New Dataset and Evaluation Scheme for Baseline Detection in Archival Documents
Tobias Grüning, Roger Labahn
Computational Intelligence Technology Lab
University of Rostock
18057 Rostock, Germany
{tobias.gruening, roger.labahn}@uni-rostock.de
Markus Diem, Florian Kleber and Stefan Fiel
Computer Vision Lab
TU Wien
1040 Vienna, Austria
{diem,kleber,fiel}@cvl.tuwien.ac.at
Abstract
Text line detection is crucial for any application associated with Automatic Text Recognition or Keyword Spotting. Modern algorithms perform good on well-established datasets since they either comprise clean data or simple/homogeneous page layouts. We have collected and annotated archival document images from different locations and time periods. The dataset contains varying page layouts and degradations that challenge text line segmentation methods. Well established text line segmentation evaluation schemes such as the Detection Rate or Recognition Accuracy demand for binarized data that is annotated on a pixel level. Producing ground truth by these means is laborious and not needed to determine a method’s quality. In this paper we propose a new evaluation scheme that is based on baselines. The proposed scheme has no need for binarization and it can handle skewed as well as rotated text lines. The ICDAR 2017 Competition on Baseline Detection and the ICDAR 2017 Competition on Layout Analysis for Challenging Medieval Manuscripts used this evaluation scheme. Finally, we present results achieved by a recently published text line detection algorithm.
I Introduction
Layout analysis (LA) is considered an open research topic especially for historical collections and is a major pre-processing step for e.g. Keyword Spotting (KWS) or Handwritten Text Recognition (HTR). In the last years several competitions were organized to evaluate the performance of layout analysis algorithms: Some focusing purely on LA [1, 2, 3, 4, 5, 6], some requiring a good LA as pre-processing step to achieve competitive results [7, 8, 9]. The ongoing effort in organizing such competitions strongly indicates that there is still a need for improvement concerning LA.
Even state-of-the-art algorithms have problems if they are faced with degradations related to historical documents [6], e.g. faded-out ink, bleed-through, marginalia, skewed and touching/overlapping text lines. In contrast, reported results of LA algorithms perform surprisingly well with accuracies far better than [10, 11, 12, 13, 14, 15]. This is basically due to the fact that the well established easily accessible datasets (like the IAM-HistDB consisting of Saint Gall Database [16], Parzival Database [17] and Washington Database [17], as well as the datasets provided via the competitions [1, 3, 5], the datasets introduced in [13] and even newly proposed datasets like the collection of Southeast Asian palm leaf manuscript images [18] are not covering the full range of difficulties present in historical documents. The datasets contain either modern, well aligned handwritten texts without any serious difficulties for state-of-the-art algorithms at all or very homogeneous layouts within a dataset, hence it is an ease to adapt algorithms to such datasets.
Since state-of-the-art methods achieve high accuracies on well-established datasets, there is a need for a new, challenging dataset with complex page layouts and a greater variety in terms of script, time range and place of origin. A huge variety of degradations as well as different resolutions and orientations should be present. Since the landscape of document analysis has changed over the last years, and machine learning based algorithms get more and more popular not only for KWS [19] and HTR [20] but also for LA [21, 22, 23], the dataset should consist of hundreds of pages to provide an appropriate amount of training samples.
Besides the characteristics of the images the kind of ground truth (GT) provided is essential. The variety of GT given for different datasets ranges from origin points [6] over polygons surrounding the text lines [16, 17] and ground truth on pixel level [1, 3, 13] to detailed information about text region entities [4] and reading order [8]. Since in the most application scenarios LA is mainly a pre-processing step for HTR, it is meaningful to provide goal-oriented GT. Modern HTR systems require text lines as input [20, 19], that is why we will restrict ourselves to the text line detection scenario and ignore issues like entity classification and reading order. Nevertheless in complex layout scenarios (e.g. tables, multi-column texts, present marginalia), it is mandatory to detect the page layout to achieve correct text line segmentation results. Ignoring the page layout typically leads to an undersegmentation of text lines, see Sec. IV. Therefore, the text line segmentation scenario somehow comprises the page segmentation scenario as a required intermediate processing step.
To characterize the text lines using solely origin points is in our opinion not sufficient since they don’t cover the characteristics, e.g. skew, orientation, dimension, … , of the text lines at all. On the other hand, [24] showed that the HTR accuracy is not significantly effected by the polygon surrounding the text lines. Even simple strategies to construct surrounding polygons given baseline representations lead to satisfying results [24]. Therefore, GT based on baseline representations for the text lines is in our opinion a reasonable compromise. Furthermore, annotating baselines is less cumbersome than surrounding polygons and therefore cheaper.
Since the widely-used evaluation schemes rely on surrounding polygons and use area (or foreground pixel) based methods to calculate the accuracy of text line segmentation results, there is a need for an evaluation scheme suitable for baselines.
In this paper, we introduce a new dataset containing pages of historical documents with annotated baselines. Furthermore, we propose a newly developed, goal-oriented evaluation scheme working with baseline representations of the text lines. This scheme was already used in two layout analysis competitions, namely the ICDAR Competition on Baseline Detection (cBAD) and the ICDAR Competition on Layout Analysis for Challenging Medieval Manuscripts. While we published a report of the cBAD competition alongside with the evaluation scheme in [25], this paper aims at a thorough introduction of the evaluation scheme. In addition, the collection of the dataset and its sources are described.
The remaining paper is structured as follows, in Section II the dataset is described, a meaningful subdivision is explained and some example pages as well as statistics are shown. Section 2 describes the newly proposed evaluation scheme along with some examples demonstrating the functionality of the scheme. In Section IV the results obtained by a recently published text line detection method are presented. Section V concludes the paper.
II Dataset
The ICDAR Competition on Baseline Detection (cBAD) dataset [26] is composed of document page images that were collected from different archives.
II-A Baseline Definition
A baseline is defined in the typographical sense as the virtual line where most characters rest upon and descenders extend below. Text lines are annotated by one single baseline. Hence, non-textual symbols are not annotated. Non-textual symbols include: decoration lines, dotted lines, images, noise/stains, initials, bleed-through text. A baseline is split if
- •
it spans different columns.
- •
it spans different document pages.
- •
it connects marginalia and the body text.
If a text line is clearly not part of a table (column) system, a single baseline is annotated even crossing column borders.
II-B The cBAD Dataset
About document images from each of different European archives were collected. These documents were written between and . We sampled images from each archival collection using a freely available python script111https://github.com/TUWien/Benchmarking. This results in a set of document images. A more detailed description of the different document collections is given below.
Archive Bistum Passau (ABP): collection contains images photographed at dpi. The documents include parish registers of baptisms, marriages, and funerals.
Bohisto - Bozen State Archive: page images of council minutes written between and .
Venice Time Machine (EPFL): about pages from indexes of records, records of real property transactions, and daily death registrations written between the th and th century.
Humboldt University Berlin (HUB): student notes of lectures given by Alexander von Humboldt between and .
National Archive Finland (NAF): page images from account books, a court book, a census book, and a church book that cover a time period from until the s.
Marburg State Archive: page images from the Grimm collection comprising letters, postcards, and greeting cards.
University College London (UCL): the Bentham papers include . Most pages were written by the British philosopher Jeremy Bentham between and .
Brabant Archive (BHIC): composed of various types of tables containing census information.
University Bibliography Basel (unibas): e-manuscripta222http://www.e-manuscripta.ch/.
II-C Data Annotation
After removing images due to quality as well as content issues the number reduced from to . For these images the text regions as well as baselines were annotated by DigiTexx. The well-known PAGE XML333http://www.primaresearch.org/tools scheme is used for storing text region and baseline information. A final review process by two independent operators reduced the total number to images. All in all annotated baseline are available.
This annotated dataset is split into two subsets: Simple Documents and Complex Documents. The first includes only pages with simple page layouts and annotated text regions. Hence, this could be used for a track to evaluate the text line segmentation only, thus neglecting issues that arise from the page layout. The second subset Complex Documents includes full page tables, multi column text and rotated text lines. The challenge is not only to robustly detect baselines but also to split baselines correctly with respect to the page layout.
Both subsets are split into a training and a test set. For training images are taken from each collection resulting in training images for Simple Documents and images for Complex Documents. The data along with the GT is publicly available[26]. Two example images are shown in Fig. 1.
III Evaluation Scheme
Since baseline detection is the first step in the information retrieval pipeline of an classical workflow, there are special requirements regarding the evaluation scheme:
- •
The evaluation scheme should indicate how reliable the text is detected – ignoring layout issues. The value reflecting this is called R-value, since it has similar properties as the well-known recall value.
- •
The evaluation scheme should indicate how reliable the structure of the text lines (layout) of the document is detected. The value reflecting this is called P-value, since it has similar properties as the well-known precision value.
- •
The evaluation scheme should be invariant to small differences between ground truth and hypotheses. There is not an unique correct baseline, slightly different baselines potentially lead to the same HTR accuracy.
- •
The evaluation scheme should be able to handle skewed and oriented text lines
- •
The evaluation scheme should not rely on a reading order nor on a binarization
To our knowledge there is no evaluation scheme meeting these requirements – or even any scheme working for baselines. Hence, we propose a newly developed scheme to evaluate the performance of baseline detection algorithms. The proposed algorithm is implemented in Java and available as a standalone command line tool. It is licensed under LGPLv3 and publicly available444https://github.com/Transkribus/TranskribusBaseLineEvaluationScheme.
III-A Single Page Evaluation
In the following the calculation of R and P for a single page is explained. Let be the set of all polygonal chains (each polygonal chain represents a baseline and contains a finite number of ordered vertices, which are characterized by two coordinates). is the set of given (GT) polygonal chains representing the baselines for a single page and is the set of hypothesis (HY) polygonal chains calculated by a baseline detection algorithm for the same page, Fig. 2a. The calculation of R and P for the two sets and follows:
III-A1 Polygonal Chain Normalization
In a first step each chain is normalized, so that two adjacent vertices are in the -neighborhood of each other (have a distance ), Fig. 2b. The resulting sets of normalized chains are and . For better readability we omit the tilde. In the following and are the sets of normalized polygonal chains.
III-A2 Tolerance Value Calculation
In a second step for each chain a tolerance value is calculated. As mentioned above, the evaluation scheme should not penalize HY baselines which are slightly different to the GT baselines. Hence, some kind of tolerance is necessary. Page (and text line) dependent tolerance values are calculated, because within a collection various resolutions and layout scenarios could be present. A single pre-defined tolerance value can hardly cover all these scenarios in a satisfying fashion. Since the -coordinates of the vertices are typically “wrongly” oriented in computer vision scenarios, they have to be negated for the following procedure. To calculate , the orientation of is estimated using linear regression. is the vector of length of orientation . Given the set of all vertices of the chains in , the subset is calculated such that for any there are at least two vertices satisfying
[TABLE]
Condition (1) means that the projections of and into the direction of have different algebraic signs (or have length zero). In Fig. 2c the set of vertices for GT baseline is shown (green points). For each one vertex is determined for which the projection of into the direction of has minimal length
[TABLE]
The minimum distance of to another chain is calculated by
[TABLE]
Subscripts and are the - and -coordinate of vector . is the minimal length of the projections of all into the direction orthogonal to , see Fig. 2c (green lines). For there are no other baselines allowing a meaningful calculation of , hence its tolerance value is set to some default value ( was chosen). Condition (1) is essential since is the basis for the estimation of the minimal distance of to another chain. For instance the yellow vertex Fig. 2c has a significantly shorter orthogonal projection to GT line , but of course would falsify the statistics. The mean of all () with a value different to the default value is calculated. Finally, the GT baseline dependent tolerance values are calculated
[TABLE]
of the estimated interline distance yields a reasonable compromise between accuracy and flexibility. is the set containing the resulting tolerance values, in Fig. 2d the blue areas show the individual tolerance areas for the different GT baselines.
III-A3 Coverage Function
Employing the (tolerance dependent) function implemented via Alg. 1, one can determine a value representing the fraction of chain for which there is a vertex of chain within a certain tolerance area (skew-invariant).
Alg. 1 counts the number of vertices of for which there is a vertex of with a distance less than the given tolerance value . Furthermore a smooth (linear) transition is performed for vertices with a distance between and . A vertex with a distance less than counts , with a distance of it counts , with a distance of it counts , … Finally, a vertex with a distance of and more counts [math]. The resulting value is normalized using the number of vertices of .
Let be the generic extension of COV to a function accepting sets of polygonal chains as second argument. The minimum from line in Alg. 1 is calculated over a set of chains instead of a single chain. To clarify the functionality of the coverage functions a few exemplary values are shown in Tab. I. Especially, the function COV is not commutative in the first two arguments.
III-A4 R and P Calculation
The tolerance dependent R value of and is finally calculated by
[TABLE]
The R value indicates for what fraction of the GT baselines there are detected HY baselines within a certain tolerance area. Segmentation (page layout) errors are not penalized at all, because no alignment between GT and HY baselines is enforced.
These segmentation errors are penalized in the P value. Let be an alignment of GT and HY chains where each element of as well as of occurs at most once. The tolerance dependent P value of and is calculated as follows
[TABLE]
An alignment ensures that segmentation errors are penalized. E.g. if a text line is split into two equally sized parts, a R value of is calculated (the two detected chains cover the entire GT chain), but the expected P value is (the GT chain is aligned with exactly one of the HY chains with a P value of , this is divided by , because there are two HY chains). We want to mention that for both cases (R and P) short text lines have the same impact as long ones, because in (2) and (3) the line specific R and P values are divided by the number of GT respectively HY lines. This prevents the proposed evaluation scheme from underestimating the importance of short text lines, which often contain essential information in the context of historical documents, e.g. dates.
III-A5 Greedy-based Alignment
To evaluate (3) an P-optimal alignment is necessary. Therefore a P matrix is calculated with elements . Based on this, the alignment is calculated in a greedy manner , see Alg. 2. A greedy approach was chosen, because there is no reading order available (no dynamic programming possible) and the greedy solution is in most practical cases the exact solution.
III-A6 Harmonic Mean (F value)
Finally, the harmonic mean of R and P, we call it F value,
[TABLE]
is calculated.
III-B Multi Page Evaluation
Since the dataset is very heterogeneous, each page is evaluated on its own. The average is calculated for this page-wise results. This prevents an overbalance of pages with dozens of baselines (like pages containing a table) and yields results representing the robustness of the evaluated algorithms over various scenarios.
III-C Examples
Results for different subsets of the GT and HY baselines of Fig. 2a are shown in Tab. II and explained in the following.
The small difference between Ex. and Ex. is due to the fact, that in both cases is aligned to for the P calculation. Hence, there is no effect on P if is removed. R is nearly the same, because and are both completely covered by . By removing instead of (Ex. ), is now aligned to yielding a lower P value, because covers much more of than . In Ex. one gets a high P value, because the remaining HY baselines are very well covered by the GT baselines. By adding (Ex. ) we of course increase R, but decrease P. This is due to the fact that is aligned to (as in Ex. ) and is not aligned at all and gets a P value of [math].
IV Baseline System
In this section we present the results obtained by applying the text line detection algorithm presented in [27]. This approach relies on the clustering of so-called superpixels (SPs). These SPs were calculated utilizing the classical FAST algorithm. The algorithm does not rely on any training process. Hence, the training subset was ignored and the proposed algorithm was just applied for the test subset (without any parameter tuning). The results obtained are depicted in Tab. III.
As mentioned in [27] the method struggles if faced with complex layouts. The method suffers from undersegmentation problems and results in a bad accuracy for the complex track compared to the simple track.
V Conclusion
A new dataset consisting of pages of archival documents with annotated baselines was introduced. A wide span of different times as well as locations is covered. The dataset contains documents with various degradations and complex layouts. Along with the dataset a goal-oriented evaluation scheme based on baseline representations is introduced. Finally, the results obtained by a baseline system are shown. This work provides new challenges as well as a solid basis for competitive evaluations for the document layout community.
Acknowledgment
This work was partially funded by the European Union’s Horizon research and innovation programme under grant agreement No (READ – Recognition and Enrichment of Archival Documents).
The reference list from the paper itself. Each links out to its DOI / PubMed record.
- 1[1] B. Gatos, N. Stamatopoulos, and G. Louloudis, “ICDAR 2009 handwriting segmentation contest,” International Journal on Document Analysis and Recognition , vol. 14, no. 1, pp. 25–33, 2011.
- 2[2] A. Antonacopoulos, S. Pletschacher, D. Bridson, and C. Papadopoulos, “ICDAR 2009 page segmentation competition,” in Proceedings of the International Conference on Document Analysis and Recognition, ICDAR , 2009, pp. 1370–1374.
- 3[3] B. Gatos, N. Stamatopoulos, and G. Louloudis, “ICFHR 2010 handwriting segmentation contest,” in Proceedings - 12th International Conference on Frontiers in Handwriting Recognition, ICFHR 2010 , 2010, pp. 737–742.
- 4[4] A. Antonacopoulos, C. Clausner, C. Papadopoulos, and S. Pletschacher, “Historical document layout analysis competition,” in Proceedings of the International Conference on Document Analysis and Recognition, ICDAR , 2011, pp. 1516–1520.
- 5[5] N. Stamatopoulos, B. Gatos, G. Louloudis, U. Pal, and A. Alaei, “ICDAR 2013 handwriting segmentation contest,” in Proceedings of the International Conference on Document Analysis and Recognition, ICDAR , 2013, pp. 1402–1406.
- 6[6] M. Murdock, S. Reid, B. Hamilton, and J. Reese, “ICDAR 2015 competition on text line detection in historical documents,” in Proceedings of the International Conference on Document Analysis and Recognition, ICDAR , vol. 2015-November. IEEE, aug 2015, pp. 1171–1175.
- 7[7] A. Antonacopoulos, C. Clausner, C. Papadopoulos, and S. Pletschacher, “ICDAR 2013 competition on historical book recognition (HBR 2013),” in Proceedings of the International Conference on Document Analysis and Recognition, ICDAR , 2013, pp. 1459–1463.
- 8[8] ——, “ICDAR 2015 competition on recognition of documents with complex layouts - RDCL 2015,” in Proceedings of the International Conference on Document Analysis and Recognition, ICDAR , vol. 2015-Novem, 2015, pp. 1151–1155.
