Lifelong learning for text retrieval and recognition in historical handwritten document collections
Lambert Schomaker

TL;DR
This paper discusses the development of a lifelong learning system for text retrieval and recognition in large, diverse collections of historical handwritten documents, emphasizing scalability and evolving ground truth.
Contribution
It introduces the 'ball-park principle' to guide the transition from traditional to deep learning methods based on data labeling levels.
Findings
Deep learning offers high potential but requires scalable data labeling.
The 'ball-park principle' helps manage the evolution of learning approaches.
The system addresses variability across scripts and languages in historical documents.
Abstract
This chapter provides an overview of the problems that need to be dealt with when constructing a lifelong-learning retrieval, recognition and indexing engine for large historical document collections in multiple scripts and languages, the Monk system. This application is highly variable over time, since the continuous labeling by end users changes the concept of what a 'ground truth' constitutes. Although current advances in deep learning provide a huge potential in this application domain, the scale of the problem, i.e., more than 520 hugely diverse books, documents and manuscripts precludes the current meticulous and painstaking human effort which is required in designing and developing successful deep-learning systems. The ball-park principle is introduced, which describes the evolution from the sparsely-labeled stage that can only be addressed by traditional methods or…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques
