A Systematic Benchmark of Machine Transliteration Models for the Tajik-Farsi Language Pair: A Comparative Study from Rule-Based to Transformer Architectures
Mullosharaf K. Arabov

TL;DR
This study systematically compares various machine learning models for Tajik-Farsi transliteration, highlighting the superior performance of byte-level models like ByT5 over subword-based models.
Contribution
First comprehensive benchmark of modern transliteration models for Tajik-Farsi, including a new parallel corpus and evaluation of diverse architectures from rule-based to transformer.
Findings
ByT5 achieves the highest accuracy with chrF++ scores of 87.4 and 80.1.
G2P Transformer outperforms mBART despite limited data.
Subword tokenization models like mT5 perform poorly in this task.
Abstract
This paper presents the first comprehensive comparative analysis of modern machine learning architectures for transliteration between Tajik (Cyrillic script) and Persian (Arabic script). A key contribution is the creation and validation of a unique parallel corpus aggregated from multiple heterogeneous sources, including crowdsourced projects, lexicographic pairs, parallel texts of "Shahnameh", diplomatic articles, texts of "Masnavi-i Ma'navi", official terminology lists, and transliterated correspondences. The initial dataset comprised 328,253 sentence pairs; a representative subset of 40,000 pairs was formed using stratified random sampling. The experiment compared six classes of models: rule-based baseline, LSTM with attention, character-level Transformer, G2P Transformer (trained from scratch), pre-trained multilingual models (mBART, mT5 with LoRA), and byte-level ByT5. Results…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
