Normalizing Text using Language Modelling based on Phonetics and String Similarity
Fenil Doshi, Jimit Gandhi, Deep Gosalia, Sudhir Bagul

TL;DR
This paper introduces a robust text normalization method using BERT with novel masking strategies based on phonetic and string similarity, effectively handling informal and adversarial text variations for improved language processing tasks.
Contribution
The paper presents a new BERT-based normalization approach with two innovative masking strategies leveraging phonetic and string similarity metrics, enhancing robustness against informal and adversarial text.
Findings
Achieved 86.7% accuracy in text normalization
Effective in handling informal and adversarial spelling variations
Improves downstream language processing tasks
Abstract
Social media networks and chatting platforms often use an informal version of natural text. Adversarial spelling attacks also tend to alter the input text by modifying the characters in the text. Normalizing these texts is an essential step for various applications like language translation and text to speech synthesis where the models are trained over clean regular English language. We propose a new robust model to perform text normalization. Our system uses the BERT language model to predict the masked words that correspond to the unnormalized words. We propose two unique masking strategies that try to replace the unnormalized words in the text with their root form using a unique score based on phonetic and string similarity metrics.We use human-centric evaluations where volunteers were asked to rank the normalized text. Our strategies yield an accuracy of 86.7% and 83.2% which…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Topic Modeling · Speech Recognition and Synthesis
MethodsLinear Layer · Weight Decay · Softmax · Adam · Multi-Head Attention · Dropout · Refunds@Expedia|||How do I get a full refund from Expedia? · Attention Dropout · Linear Warmup With Linear Decay · Dense Connections
