GlotScript: A Resource and Tool for Low Resource Writing System Identification
Amir Hossein Kargaran, Fran\c{c}ois Yvon, Hinrich Sch\"utze

TL;DR
GlotScript is a comprehensive resource and tool for identifying writing systems in low-resource languages, aiding NLP tasks and analysis of language model coverage.
Contribution
It introduces GlotScript-R and GlotScript-T, providing a large dataset of writing systems and an identification tool covering all Unicode scripts, with practical use cases demonstrated.
Findings
GlotScript helps clean multilingual corpora effectively.
It reveals low resource script coverage in models like GPT-4.
The resource supports low resource language research in NLP.
Abstract
We present GlotScript, an open resource and tool for low resource writing system identification. GlotScript-R is a resource that provides the attested writing systems for more than 7,000 languages. It is compiled by aggregating information from existing writing system resources. GlotScript-T is a writing system identification tool that covers all 161 Unicode 15.0 scripts. For an input text, it returns its script distribution where scripts are identified by ISO 15924 codes. We also present two use cases for GlotScript. First, we demonstrate that GlotScript can help cleaning multilingual corpora such as mC4 and OSCAR. Second, we analyze the tokenization of a number of language models such as GPT-4 using GlotScript and provide insights on the coverage of low resource scripts and languages by each language model. We hope that GlotScript will become a useful resource for work on low resource…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques
MethodsMulti-Head Attention · Attention Is All You Need · Linear Layer · Dropout · Adam · Layer Normalization · Label Smoothing · Byte Pair Encoding · Absolute Position Encodings · Dense Connections
