Predicting the Ordering of Characters in Japanese Historical Documents
Alex Lamb, Tarin Clanuwat, Siyu Han, Mikel Bober-Irizar, Asanobu, Kitamoto

TL;DR
This paper presents methods to predict the sequence order of characters in Japanese historical documents, crucial for readability and NLP tasks, achieving high accuracy with deep learning and rule-based approaches.
Contribution
It introduces novel approaches, including rule-based and deep learning models, for predicting character order in classical Japanese texts, addressing a gap in existing research.
Findings
Deep learning model achieved 98.65% accuracy
Nearly half of the books had perfect sequence prediction
Rule-based methods offer trade-offs in accuracy and complexity
Abstract
Japan is a unique country with a distinct cultural heritage, which is reflected in billions of historical documents that have been preserved. However, the change in Japanese writing system in 1900 made these documents inaccessible for the general public. A major research project has been to make these historical documents accessible and understandable. An increasing amount of research has focused on the character recognition task and the location of characters on image, yet less research has focused on how to predict the sequential ordering of the characters. This is because sequence in classical Japanese is very different from modern Japanese. Ordering characters into a sequence is important for making the document text easily readable and searchable. Additionally, it is a necessary step for any kind of natural language processing on the data (e.g. machine translation, language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Topic Modeling
