Comparing Performance of Different Linguistically-Backed Word Embeddings for Cyberbullying Detection
Juuso Eronen, Michal Ptaszynski, Fumito Masui

TL;DR
This paper explores the integration of linguistic features like morphology and syntax into word embeddings to improve their capacity for complex tasks such as cyberbullying detection.
Contribution
It introduces a method to combine linguistic information with raw tokens in embeddings, potentially enhancing language model performance on nuanced tasks.
Findings
Linguistically enriched embeddings capture deeper lexical relations.
Enhanced embeddings improve cyberbullying detection accuracy.
Method can be applied to pre-training large language models.
Abstract
In most cases, word embeddings are learned only from raw tokens or in some cases, lemmas. This includes pre-trained language models like BERT. To investigate on the potential of capturing deeper relations between lexical items and structures and to filter out redundant information, we propose to preserve the morphological, syntactic and other types of linguistic information by combining them with the raw tokens or lemmas. This means, for example, including parts-of-speech or dependency information within the used lexical features. The word embeddings can then be trained on the combinations instead of just raw tokens. It is also possible to later apply this method to the pre-training of huge language models and possibly enhance their performance. This would aid in tackling problems which are more sophisticated from the point of view of linguistic representation, such as detection of…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.
Taxonomy
TopicsHate Speech and Cyberbullying Detection · Software Engineering Research
MethodsRefunds@Expedia|||How do I get a full refund from Expedia? · Multi-Head Attention · Attention Is All You Need · Linear Layer · Weight Decay · Softmax · Layer Normalization · Attention Dropout · WordPiece · Adam
