Rule based Approach for Word Normalization by resolving Transcription Ambiguity in Transliterated Search Queries
Varsha Pathak, Manish Joshi

TL;DR
This paper presents a rule-based method to normalize transliterated search queries in Indian languages, resolving transcription ambiguities to improve information retrieval accuracy.
Contribution
It introduces a novel rule-based approach incorporating Levenshtein distance for word normalization in transliterated queries, specifically for Marathi and Hindi literature search.
Findings
Improved query matching accuracy in Marathi and Hindi literature retrieval.
Effective handling of transcription noise in user queries.
Multiple rule set variations tested with positive results.
Abstract
Query term matching with document term matching is the basic function of any best effort Information Retrieval models like Vector Space Model. In our problem of SMS based Information Systems we expect common people to participate in information search. Our system allows mobile users to formulate their queries in their own words, own transliteration style and spelling formation. To achieve this flexibility we have resolved the term level ambiguity due to inherent transcription noise in user query terms. We have developed a rule based approach to select most relevantly close standard term for each noisy term in the user query. We have used four different versions of the rule based algorithm with variation in the rule set. We have formulated this rule set including the basic Levenshtein minimum edit distance algorithm for term matching. This paper presents the experiments and corresponding…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Algorithms and Data Compression · Topic Modeling
