Cross-Lingual Text Classification of Transliterated Hindi and Malayalam
Jitin Krishnan, Antonios Anastasopoulos, Hemant Purohit, Huzefa, Rangwala

TL;DR
This paper addresses the challenge of classifying transliterated Hindi and Malayalam text on social media by combining data augmentation with Teacher-Student training, improving multilingual models' performance on real-world transliterated datasets.
Contribution
It introduces a novel approach integrating data augmentation and Teacher-Student training for cross-lingual transfer in transliterated NLP tasks, with new datasets for benchmarking.
Findings
Average F1 score improvement of +5.6% on mBERT
Average F1 score improvement of +4.7% on XLM-R
Effective handling of transliterated social media text
Abstract
Transliteration is very common on social media, but transliterated text is not adequately handled by modern neural models for various NLP tasks. In this work, we combine data augmentation approaches with a Teacher-Student training scheme to address this issue in a cross-lingual transfer setting for fine-tuning state-of-the-art pre-trained multilingual language models such as mBERT and XLM-R. We evaluate our method on transliterated Hindi and Malayalam, also introducing new datasets for benchmarking on real-world scenarios: one on sentiment classification in transliterated Malayalam, and another on crisis tweet classification in transliterated Hindi and Malayalam (related to the 2013 North India and 2018 Kerala floods). Our method yielded an average improvement of +5.6% on mBERT and +4.7% on XLM-R in F1 scores over their strong baselines.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsNatural Language Processing Techniques · Text Readability and Simplification · Topic Modeling
MethodsXLM-R · mBERT
