Beyond Arabic: Software for Perso-Arabic Script Manipulation
Alexander Gutkin, Cibu Johny, Raiomond Doctor, Brian Roark, Richard, Sproat

TL;DR
This paper introduces an open-source FST-based library for advanced manipulation, normalization, and transliteration of Perso-Arabic scripts across multiple languages, surpassing standard Unicode normalization.
Contribution
It provides a comprehensive, formalized toolkit for Perso-Arabic script processing, including normalization, regional orthography transformations, and romanization, applicable to diverse languages.
Findings
Library supports script normalization beyond Unicode forms
Includes regional orthography transformations for eleven languages
Offers FST-based romanization and transliteration tools
Abstract
This paper presents an open-source software library that provides a set of finite-state transducer (FST) components and corresponding utilities for manipulating the writing systems of languages that use the Perso-Arabic script. The operations include various levels of script normalization, including visual invariance-preserving operations that subsume and go beyond the standard Unicode normalization forms, as well as transformations that modify the visual appearance of characters in accordance with the regional orthographies for eleven contemporary languages from diverse language families. The library also provides simple FST-based romanization and transliteration. We additionally attempt to formalize the typology of Perso-Arabic characters by providing one-to-many mappings from Unicode code points to the languages that use them. While our work focuses on the Arabic script diaspora…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Language, Linguistics, Cultural Analysis · Handwritten Text Recognition Techniques
MethodsLib
