TL;DR
This paper introduces a large, verified parallel corpus for Tajik-Persian transliteration and trains a character-level Transformer model that outperforms previous baselines, with all resources released for reproducibility.
Contribution
The study provides one of the largest publicly available Tajik-Persian transliteration corpora and a high-performing Transformer model, along with comprehensive data and code release.
Findings
Transformer achieves CER of 0.3216 and accuracy of 0.3133.
Beam search improves performance to CER 0.3182 and accuracy 0.3215.
Corpus and scripts are released for further research.
Abstract
This study addresses automatic transliteration from Tajik (Cyrillic script) to Persian (Perso-Arabic script). We present a curated, lexicographically verified parallel corpus of 52,152 Tajik--Persian words and short phrases, compiled from printed dictionaries, encyclopedic sources, and manually verified online resources. To the best of our knowledge, this is one of the largest publicly available word-level corpora for Tajik--Persian transliteration. Using this corpus, we train a character-level sequence-to-sequence Transformer model and evaluate it using Character Error Rate (CER) and exact-match accuracy. The Transformer achieves a CER of 0.3216 and an exact-match accuracy of 0.3133, outperforming both dictionary-based rule-based and recurrent neural baselines. With beam search (k=3), performance further improves to CER 0.3182 and accuracy 0.3215. We describe the data collection and…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
