Vartani Spellcheck -- Automatic Context-Sensitive Spelling Correction of OCR-generated Hindi Text Using BERT and Levenshtein Distance
Aditya Pal, Abhijit Mustafi

TL;DR
Vartani Spellcheck introduces a context-sensitive Hindi spelling correction method combining BERT and Levenshtein distance, significantly improving OCR text accuracy for inflectional Indic languages.
Contribution
The paper presents a novel context-aware Hindi spelling correction approach using BERT and Levenshtein distance, outperforming previous context-free models.
Findings
Achieved 81% correction accuracy on OCR-generated Hindi text.
Significant improvement over previous context-sensitive correction methods.
Demonstrated potential for real-time autocorrect in text editors.
Abstract
Traditional Optical Character Recognition (OCR) systems that generate text of highly inflectional Indic languages like Hindi tend to suffer from poor accuracy due to a wide alphabet set, compound characters and difficulty in segmenting characters in a word. Automatic spelling error detection and context-sensitive error correction can be used to improve accuracy by post-processing the text generated by these OCR systems. A majority of previously developed language models for error correction of Hindi spelling have been context-free. In this paper, we present Vartani Spellcheck - a context-sensitive approach for spelling correction of Hindi text using a state-of-the-art transformer - BERT in conjunction with the Levenshtein distance algorithm, popularly known as Edit Distance. We use a lookup dictionary and context-based named entity recognition (NER) for detection of possible spelling…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Speech Recognition and Synthesis · Topic Modeling
MethodsLinear Layer · Linear Warmup With Linear Decay · Attention Is All You Need · Layer Normalization · Dropout · Weight Decay · Dense Connections · Refunds@Expedia|||How do I get a full refund from Expedia? · Adam · Attention Dropout
