Predicting the Ordering of Characters in Japanese Historical Documents

Alex Lamb; Tarin Clanuwat; Siyu Han; Mikel Bober-Irizar; Asanobu; Kitamoto

arXiv:2106.06786·cs.CL·June 15, 2021

Predicting the Ordering of Characters in Japanese Historical Documents

Alex Lamb, Tarin Clanuwat, Siyu Han, Mikel Bober-Irizar, Asanobu, Kitamoto

PDF

Open Access

TL;DR

This paper presents methods to predict the sequence order of characters in Japanese historical documents, crucial for readability and NLP tasks, achieving high accuracy with deep learning and rule-based approaches.

Contribution

It introduces novel approaches, including rule-based and deep learning models, for predicting character order in classical Japanese texts, addressing a gap in existing research.

Findings

01

Deep learning model achieved 98.65% accuracy

02

Nearly half of the books had perfect sequence prediction

03

Rule-based methods offer trade-offs in accuracy and complexity

Abstract

Japan is a unique country with a distinct cultural heritage, which is reflected in billions of historical documents that have been preserved. However, the change in Japanese writing system in 1900 made these documents inaccessible for the general public. A major research project has been to make these historical documents accessible and understandable. An increasing amount of research has focused on the character recognition task and the location of characters on image, yet less research has focused on how to predict the sequential ordering of the characters. This is because sequence in classical Japanese is very different from modern Japanese. Ordering characters into a sequence is important for making the document text easily readable and searchable. Additionally, it is a necessary step for any kind of natural language processing on the data (e.g. machine translation, language…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsHandwritten Text Recognition Techniques · Natural Language Processing Techniques · Topic Modeling