TL;DR
This paper investigates how much gender and language information can be inferred from usernames using unsupervised morphology induction, showing that morphological features outperform character n-gram baselines.
Contribution
It introduces a method leveraging unsupervised morphology induction to extract features from usernames for demographic inference, demonstrating improved accuracy over simple baselines.
Findings
Morphological features outperform character n-gram baselines
Gender and language can be inferred from usernames with reasonable accuracy
Unsupervised morphology induction effectively captures meaningful sub-units in usernames
Abstract
Usernames are ubiquitous on the Internet, and they are often suggestive of user demographics. This work looks at the degree to which gender and language can be inferred from a username alone by making use of unsupervised morphology induction to decompose usernames into sub-units. Experimental results on the two tasks demonstrate the effectiveness of the proposed morphological features compared to a character n-gram baseline.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
