TajPersLexon: A Tajik-Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP

Mullosharaf K. Arabov

arXiv:2605.06886·cs.CL·May 11, 2026

TajPersLexon: A Tajik-Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP

Mullosharaf K. Arabov

PDF

1 Video

TL;DR

This paper introduces TajPersLexon, a Tajik-Persian lexical resource, and evaluates various models for cross-script lexical tasks, demonstrating high accuracy and practical efficiency in low-resource settings.

Contribution

The creation of TajPersLexon and a comprehensive benchmark comparing hybrid, neural, and retrieval methods for cross-script NLP in low-resource contexts.

Findings

01

Neural and retrieval models achieve 98-99% top-1 accuracy on lexical matching.

02

The hybrid model provides a good accuracy-efficiency balance for OCR post-correction.

03

Large multilingual transformers underperform on exact lexical matching tasks.

Abstract

This work introduces TajPersLexon, a curated Tajik--Persian parallel lexical resource of 40,112 word and short-phrase pairs for cross-script lexical retrieval, transliteration, and alignment in low-resource settings. We conduct a comprehensive CPU-only benchmark comparing three methodological families: (i) a lightweight hybrid pipeline, (ii) neural sequence-to-sequence models, and (iii) retrieval methods. Our evaluation establishes that the task is essentially solvable, with neural and retrieval baselines achieving 98-99% top-1 accuracy. Crucially, we demonstrate that while large multilingual sentence transformers fail on this exact lexical matching, our interpretable hybrid model offers a favorable accuracy-efficiency trade-off for practical applications, achieving 96.4% accuracy in an OCR post-correction task. All experiments use fixed random seeds for full reproducibility. The…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

TajPersLexon: A Tajik–Persian Lexical Resource and Hybrid Model for Cross-Script Low-Resource NLP· underline