TL;DR
This paper explores semi-supervised learning and Dice Loss to improve toxic span detection in texts, addressing data scarcity and class imbalance, and demonstrates their effectiveness through an ensemble approach.
Contribution
It introduces the use of semi-supervised learning and Self-Adjusting Dice Loss for toxic span detection, a novel combination for this task.
Findings
Achieved ninth place in SemEval-2021 Task 5 leaderboard.
Ensemble of Transformer models improved detection accuracy.
Techniques effectively addressed data scarcity and class imbalance.
Abstract
In this work, we present our approach and findings for SemEval-2021 Task 5 - Toxic Spans Detection. The task's main aim was to identify spans to which a given text's toxicity could be attributed. The task is challenging mainly due to two constraints: the small training dataset and imbalanced class distribution. Our paper investigates two techniques, semi-supervised learning and learning with Self-Adjusting Dice Loss, for tackling these challenges. Our submitted system (ranked ninth on the leader board) consisted of an ensemble of various pre-trained Transformer Language Models trained using either of the above-proposed techniques.
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Code & Models
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
MethodsLinear Layer · Dice Loss · Absolute Position Encodings · Position-Wise Feed-Forward Layer · Softmax · Dropout · Attention Is All You Need · Byte Pair Encoding · Residual Connection · Layer Normalization
