Labeling of Query Words using Conditional Random Field
Satanu Ghosh, Souvick Ghosh, Dipankar Das

TL;DR
This paper presents a CRF-based approach for labeling query words in mixed script information retrieval, achieving around 75.5% accuracy across multiple Indian languages and English.
Contribution
It introduces a novel CRF framework combined with dictionary and contextual analysis for language labeling in mixed script queries, specifically for Indian languages.
Findings
Achieved 75.5% overall accuracy in token-level language identification.
F-measure scores of 0.7486, 0.892, and 0.7972 for Bengali, English, and Hindi.
System demonstrated effective language identification in mixed script queries.
Abstract
This paper describes our approach on Query Word Labeling as an attempt in the shared task on Mixed Script Information Retrieval at Forum for Information Retrieval Evaluation (FIRE) 2015. The query is written in Roman script and the words were in English or transliterated from Indian regional languages. A total of eight Indian languages were present in addition to English. We also identified the Named Entities and special symbols as part of our task. A CRF based machine learning framework was used for labeling the individual words with their corresponding language labels. We used a dictionary based approach for language identification. We also took into account the context of the word while identifying the language. Our system demonstrated an overall accuracy of 75.5% for token level language identification. The strict F-measure scores for the identification of token level language…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Authorship Attribution and Profiling · Handwritten Text Recognition Techniques
