All that is English may be Hindi: Enhancing language identification through automatic ranking of likeliness of word borrowing in social media
Jasabanta Patro, Bidisha Samanta, Saurabh Singh, Abhipsa Basu,, Prithwish Mukherjee, Monojit Choudhury, Animesh Mukherjee

TL;DR
This paper introduces computational methods to predict word borrowing likeliness in social media, significantly improving language identification accuracy and revealing many foreign words should be reclassified as native.
Contribution
The paper presents novel methods for estimating word borrowing likeliness using social media signals, outperforming existing baselines by over two times.
Findings
Methods achieve Spearman correlation of 0.62, outperforming baseline of 0.26.
88% of foreign words reclassified by annotators as native.
Indicates substantial potential for improving automatic language identification.
Abstract
In this paper, we present a set of computational methods to identify the likeliness of a word being borrowed, based on the signals from social media. In terms of Spearman correlation coefficient values, our methods perform more than two times better (nearly 0.62) in predicting the borrowing likeliness compared to the best performing baseline (nearly 0.26) reported in literature. Based on this likeliness estimate we asked annotators to re-annotate the language tags of foreign words in predominantly native contexts. In 88 percent of cases the annotators felt that the foreign language tag should be replaced by native language tag, thus indicating a huge scope for improvement of automatic language identification systems.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
