Contextual Analysis for Middle Eastern Languages with Hidden Markov Models
Kazem Taghva

TL;DR
This paper introduces a machine learning approach using Hidden Markov Models for contextual analysis of Middle Eastern languages, demonstrated with Farsi, achieving high accuracy and adaptable to other similar languages.
Contribution
The paper presents a novel application of first-order Hidden Markov Models for language-specific contextual analysis, reducing the need for complex rule coding across multiple languages.
Findings
Farsi model achieves 94% accuracy.
Approach can be extended to Arabic, Urdu, Sindhi.
Software can perform language analysis without complex rules.
Abstract
Displaying a document in Middle Eastern languages requires contextual analysis due to different presentational forms for each character of the alphabet. The words of the document will be formed by the joining of the correct positional glyphs representing corresponding presentational forms of the characters. A set of rules defines the joining of the glyphs. As usual, these rules vary from language to language and are subject to interpretation by the software developers. In this paper, we propose a machine learning approach for contextual analysis based on the first order Hidden Markov Model. We will design and build a model for the Farsi language to exhibit this technology. The Farsi model achieves 94 \% accuracy with the training based on a short list of 89 Farsi vocabularies consisting of 2780 Farsi characters. The experiment can be easily extended to many languages including…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Text and Document Classification Technologies
