TL;DR
This paper introduces a comprehensive multilingual framework for detecting reclaimed slurs in social media, combining data augmentation, transfer learning, and threshold optimization to improve accuracy across languages.
Contribution
It develops a novel multi-stage approach integrating augmentation, transfer learning, and threshold tuning, with systematic evaluation of multilingual embedding models.
Findings
XLM-RoBERTa was identified as the best foundation model.
Back-translation tripled training data while maintaining semantics.
Threshold optimization improved F1 scores by 2-5%.
Abstract
This paper presents a multi-stage framework for detecting reclaimed slurs in multilingual social media discourse. It addresses the challenge of identifying reclamatory versus non-reclamatory usage of LGBTQ+-related slurs across English, Spanish, and Italian tweets. The framework handles three intertwined methodological challenges like data scarcity, class imbalance, and cross-linguistic variation in sentiment expression. It integrates data-driven model selection via cross-validation, semantic-preserving augmentation through back-translation, inductive transfer learning with dynamic epoch-level undersampling, and domain-specific knowledge injection via masked language modeling. Eight multilingual embedding models were evaluated systematically, with XLM-RoBERTa selected as the foundation model based on macro-averaged F1 score. Data augmentation via GPT-4o-mini back-translation to…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
