Specializing General-purpose LLM Embeddings for Implicit Hate Speech Detection across Datasets

Vassiliy Cheremetiev; Quang Long Ho Ngo; Chau Ying Kot; Alina Elena Baia; Andrea Cavallaro

arXiv:2508.20750·cs.CL·August 29, 2025

Specializing General-purpose LLM Embeddings for Implicit Hate Speech Detection across Datasets

Vassiliy Cheremetiev, Quang Long Ho Ngo, Chau Ying Kot, Alina Elena Baia, Andrea Cavallaro

PDF

TL;DR

This paper demonstrates that fine-tuning general-purpose LLM embeddings can significantly improve implicit hate speech detection across various datasets, achieving state-of-the-art results without external knowledge integration.

Contribution

The study shows that simple fine-tuning of large language model embeddings outperforms existing methods in implicit hate speech detection across multiple datasets.

Findings

01

Up to 1.10% improvement in in-dataset F1-macro score.

02

Up to 20.35% improvement in cross-dataset evaluation.

03

State-of-the-art performance achieved through fine-tuning LLM embeddings.

Abstract

Implicit hate speech (IHS) is indirect language that conveys prejudice or hatred through subtle cues, sarcasm or coded terminology. IHS is challenging to detect as it does not include explicit derogatory or inflammatory words. To address this challenge, task-specific pipelines can be complemented with external knowledge or additional information such as context, emotions and sentiment data. In this paper, we show that, by solely fine-tuning recent general-purpose embedding models based on large language models (LLMs), such as Stella, Jasper, NV-Embed and E5, we achieve state-of-the-art performance. Experiments on multiple IHS datasets show up to 1.10 percentage points improvements for in-dataset, and up to 20.35 percentage points improvements in cross-dataset evaluation, in terms of F1-macro score.

Tables10

Table 1. Table 1. Backbone models used for IHS detection. Multiple models indicate variations in the original work.

Backbone model

Related work

BERT

ImpCon (Kim et al., 2022), LAHN (Kim et al., 2024), SharedCon (Ahn et al., 2024),

CCL (Jiang, 2025), ConPrompt (Kim et al., 2023), MTL (Mnassri et al., 2023),

AngryBERT (Awal et al., 2021), FiADD (Masud et al., 2024), EHSor (Min et al., 2023)

Contrastive BERT (Ocampo et al., 2023)

RoBERTa

ImpCon (Kim et al., 2022), LAHN (Kim et al., 2024)

HateBERT

ImpCon (Kim et al., 2022), CCL (Jiang, 2025), FiADD (Masud et al., 2024),

Contrastive HateBERT (Ocampo et al., 2023)

mBERT

MTL (Mnassri et al., 2023)

Table 2. Table 2. Distribution of labels in the datasets.

- it ’s not fair for black americans to go jobless well you

import illegals to replace them !

- it didn’t take long to figure out jews were / are behind

white-guilt concept that ’s ravaging western-societies-

really ? so the jews are masterminding immigration

policies in order to wipe out white people ?

- in other words ”we need to keep brainwahsing whites

to accept their destruction.” #istandfordiversity

Table 3. Table 3. Results on IHC ( ElSherief et al . , 2021 ) , SBIC ( Sap et al . , 2020 ) , Dynahate ( Vidgen et al . , 2021 ) and ToxiGen ( Hartvigsen et al . , 2022 ) datasets for binary classification with hate as the positive class. We report the average over 5 runs with different seeds, the standard deviation for each metric is in parentheses. Models E5, Stella, Jasper and NV-Embed only use the tweet. Best result for each dataset/metric combination is in bold. Key- Acc: unweighted accuracy, P: precision, R: recall, F1-w: weighted F1-score, F1-M: macro F1-score, C: context features, E: emotion features, +: concatenation, AF: adaptive fusion, MoE: simple mixture of experts, SLQ: shared learnable query.

- it ’s not fair for black americans to go jobless well you

import illegals to replace them !

- it didn’t take long to figure out jews were / are behind

white-guilt concept that ’s ravaging western-societies-

really ? so the jews are masterminding immigration

policies in order to wipe out white people ?

- in other words ”we need to keep brainwahsing whites

to accept their destruction.” #istandfordiversity

Table 4. Table 4. In-dataset and cross-dataset results for different models trained on IHC ( ElSherief et al . , 2021 ) and SBIC ( Sap et al . , 2020 ) for binary classification with hate as the positive class. We report the average performance across 5 seeds with standard deviation. Models E5, Stella, Jasper and NV-Embed only use the tweet. * indicates results taken from their corresponding papers. † \dagger indicates results taken from related works referencing the method. - indicates results not available in the corresponding papers. For ImpCon ( Kim et al . , 2022 ) , ShareCon ( Ahn et al . , 2024 ) and CCL ( Jiang , 2025 ) , we added an extra zero to the results to maintain consistency with other studies that report metrics using two decimal precision. Key- Acc: unweighted accuracy, F1-M: macro F1-score, FT: fine-tuning, LP: linear probing, B: BERT backbone, HB: HateBERT backbone, RB: RoBERTa backbone.

- it ’s not fair for black americans to go jobless well you

import illegals to replace them !

- it didn’t take long to figure out jews were / are behind

white-guilt concept that ’s ravaging western-societies-

really ? so the jews are masterminding immigration

policies in order to wipe out white people ?

- in other words ”we need to keep brainwahsing whites

to accept their destruction.” #istandfordiversity

Table 5. Table 5. Words representing the most common topics in the test set of Implicit Hate Corpus dataset (some synonyms/repeated variations of words were removed manually from the representations to enhance readability). Shown words are the most frequently occurring words within topic clusters, the group of most representative words forms the representation of the topic. The Count column shows the number of tweets per topic. Words may overlap across topics, for example, the first and second topics show racial terms, but differ in focus: general hostility versus a political context.

- it ’s not fair for black americans to go jobless well you

import illegals to replace them !

- it didn’t take long to figure out jews were / are behind

white-guilt concept that ’s ravaging western-societies-

really ? so the jews are masterminding immigration

policies in order to wipe out white people ?

- in other words ”we need to keep brainwahsing whites

to accept their destruction.” #istandfordiversity

Table 6. Table 6. Examples of tweets from IHC ( ElSherief et al . , 2021 ) and their corresponding context generated by Llama 2.

- it ’s not fair for black americans to go jobless well you

import illegals to replace them !

- it didn’t take long to figure out jews were / are behind

white-guilt concept that ’s ravaging western-societies-

really ? so the jews are masterminding immigration

policies in order to wipe out white people ?

- in other words ”we need to keep brainwahsing whites

to accept their destruction.” #istandfordiversity

Table 7. Table 7. Results with linear probing on IHC ( ElSherief et al . , 2021 ) , SBIC ( Sap et al . , 2020 ) , Dynahate ( Vidgen et al . , 2021 ) and ToxiGen ( Hartvigsen et al . , 2022 ) datasets for binary classification with hate as the positive class. We report the average over 5 runs with different seeds, and the standard deviation for each metric is in parentheses. Models E5, Stella, Jasper and NV-Embed only use the tweet. Key- Acc: unweighted accuracy, P: precision, R: recall, F1-w: weighted F1-score, F1-M: macro F1-score, LP: linear probing.

- it ’s not fair for black americans to go jobless well you

import illegals to replace them !

- it didn’t take long to figure out jews were / are behind

white-guilt concept that ’s ravaging western-societies-

really ? so the jews are masterminding immigration

policies in order to wipe out white people ?

- in other words ”we need to keep brainwahsing whites

to accept their destruction.” #istandfordiversity

Table 8. Table 8. Mean processing inference speed and GPU memory usage. These numbers are obtained using the specified testing configuration.

- it ’s not fair for black americans to go jobless well you

import illegals to replace them !

- it didn’t take long to figure out jews were / are behind

white-guilt concept that ’s ravaging western-societies-

really ? so the jews are masterminding immigration

policies in order to wipe out white people ?

- in other words ”we need to keep brainwahsing whites

to accept their destruction.” #istandfordiversity

Table 9. Table 9. Samples from IHC ( ElSherief et al . , 2021 ) classified with high confidence by BERTweet with context and emotion.

- it ’s not fair for black americans to go jobless well you

import illegals to replace them !

- it didn’t take long to figure out jews were / are behind

white-guilt concept that ’s ravaging western-societies-

really ? so the jews are masterminding immigration

policies in order to wipe out white people ?

- in other words ”we need to keep brainwahsing whites

to accept their destruction.” #istandfordiversity

Table 10. Table 10. Samples from IHC ( ElSherief et al . , 2021 ) classified with high confidence by NV-Embed.

Prediction

Correct label

hate

not hate

- no one cares, more illegals making illegal entry

into our nation. like any criminal, some got theirs.

deport the rest.

- round them up & deport them.

- yep, the deal is, they get deported and so do their

illegal parents. then we build the wall so they never

come back. that’s my dream. does that make me a

dreamer?

- it’s not fair for black americans to go jobless well

you import illegals to replace them !

- it didn’t take long to figure out jews were / are behind

white-guilt concept that’s ravaging western-societies

- blacks & latinos attack & kill white people daily but

when blacks or latinos attack & kill cops it’s a big deal?

just sayin

not hate

- #macon ga : macon’s mlk drive ebt marts are wrapped

in anti-theft caging. and tacky yellow anti-theft cages

at that.

- cuckservatives : yes the alt-right are just a bunch of

racists

- he is ranting because the alt-lite has collapsed .

the alt-right is being proven right about

nationalism

- shri ajay tamta wins in almora

- more : the russian bombers will reportedly launch

from the ’engels’ airbase and will be armed with cruise

missiles .

- this piece seems to conflate 2 positions . i believe

royce will lead hhs faith-based office but not overall

administration faith-based office

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Full text

Specializing General-purpose LLM Embeddings for Implicit Hate Speech Detection across Datasets

Vassiliy Cheremetiev

0009-0003-3830-4536 EPFLLausanneSwitzerland

Idiap Research InstituteMartignySwitzerland

[email protected]

,

Quang Long Ho Ngo

0009-0009-2918-3385 EPFLLausanneSwitzerland

Idiap Research InstituteMartignySwitzerland

[email protected]

,

Chau Ying Kot

0009-0009-2306-8722 EPFLLausanneSwitzerland

[email protected]

,

Alina Elena Baia

0000-0001-5553-776X Idiap Research InstituteMartignySwitzerland

[email protected]

and

Andrea Cavallaro

0000-0001-5086-7858 EPFLLausanneSwitzerland

Idiap Research InstituteMartignySwitzerland

[email protected]

(2025)

Abstract.

Implicit hate speech (IHS) is indirect language that conveys prejudice or hatred through subtle cues, sarcasm or coded terminology. IHS is challenging to detect as it does not include explicit derogatory or inflammatory words. To address this challenge, task-specific pipelines can be complemented with external knowledge or additional information such as context, emotions and sentiment data. In this paper, we show that, by solely fine-tuning recent general-purpose embedding models based on large language models (LLMs), such as Stella, Jasper, NV-Embed and E5, we achieve state-of-the-art performance. Experiments on multiple IHS datasets show up to 1.10 percentage points improvements for in-dataset, and up to 20.35 percentage points improvements in cross-dataset evaluation, in terms of F1-macro score.

Content warning: This paper discusses examples of harmful text that may be offensive or upsetting.

implicit hate speech, detection, context, embeddings

††journalyear: 2025††copyright: none††conference: Proceedings of the 2nd International Workshop on Diffusion of Harmful Content on Online Web; October 27–28, 2025; Dublin, Ireland††booktitle: Proceedings of the 2nd International Workshop on Diffusion of Harmful Content on Online Web (DHOW ’25), October 27–28, 2025, Dublin, Ireland††doi: 10.1145/3746275.3762209††isbn: 979-8-4007-2057-4/2025/10††ccs: Computing methodologies Natural language processing††ccs: Human-centered computing Social media††ccs: Human-centered computing Social content sharing††ccs: Social and professional topics Hate speech

1. Introduction

Hate speech detection is important to support content moderation in digital platforms, to foster inclusive discourse and to prevent social harm.

Hate speech can be explicit or implicit. Explicit hate speech (EHS) directly targets a protected entity and contains explicit keywords. Hence, early efforts for EHS detection primarily focused on identifying explicitly abusive language through keyword-based approaches (Waseem and Hovy, 2016; Davidson et al., 2017; Schmidt and Wiegand, 2017). IHS is ”the use of coded or indirect language such as sarcasm, metaphor, and circumlocution to disparage a protected group or individual, or to convey prejudicial and harmful views about them” (Gao et al., 2017; Waseem et al., 2017; ElSherief et al., 2021). IHS has a nuanced nature and manifests through a diverse range of subtle forms such as stereotypes, humor, and sarcasm (Sap et al., 2020; Davidson et al., 2019; Founta et al., 2018; Waseem et al., 2017; Jurgens et al., 2019; Qian et al., 2019). Although IHS may not contain explicit hate words, it propagates prejudice and discrimination, and it is equally harmful as its explicit counterpart (Basile et al., 2019; Mozafari et al., 2020b). Even humans may struggle to understand the underlying meaning and intent behind such expressions (Sap et al., 2020; Hartvigsen et al., 2022).

Detecting IHS is made difficult by its lexical and semantic similarity to non-hateful content. IHS detection requires a nuanced understanding of implied meaning (Masud et al., 2024), real-world knowledge related to an event, specific social contexts, and the target.

LLMs capture and represent extensive world knowledge (Yu et al., 2024), which could be leveraged for hate speech detection. Prior works explored prompting LLMs in scenarios like zero-shot (Huang et al., 2023; Yang et al., 2023; Li et al., 2024; Zhu et al., 2023; Damo et al., 2024), zero-shot with chain-of-thought (Yang et al., 2023), and few-shot in-context learning (Zhang et al., 2024a). LLMs incorporate safeguards that prevent models from answering or discussing some sensitive topics like hateful content. Moreover, LLMs may exhibit limitations like excessive focus on sensitive groups, thus resulting in wrong classification of benign speech as hate, or extreme confidence score distributions resulting in poor calibration (Zhang et al., 2024a). Overall, these models (e.g., GPT-3.5-Turbo, LLaMa2-7B, Mixtral-8x7b) typically underperform task-specific fine-tuned models (Yang et al., 2023; Zhang et al., 2024a; Damo et al., 2024).

In this paper, we evaluate fusing multiple sources of information to enhance BERT-based classifiers and leverage the ability of LLMs to generate contextual information for IHS detection. Specifically, we explore four fusion strategies to complement content information with contextual and emotion information. We find that while information fusion via feature concatenation provides a slight improvement over content-only BERT-based classifiers, fine-tuning general-purpose LLM-based embeddings (e.g., Stella (Zhang et al., 2025), Jasper (Zhang et al., 2025), NV-Embed (Lee et al., 2025), E5 (Wang et al., 2024b)) allows us to reach new state-of-the-art performance for IHS detection. In summary, our main contributions are as follows:

•

We present a comprehensive comparative evaluation of BERT-based and recent embedding-based classifiers, and show that fusion with LLM-generated context and emotion information can only marginally enhance the performance of a BERT-based classifier. We introduce new state-of-the-art benchmarks in this category of classifiers based on fine-tuning of generalist embedding models.

•

We show that specializing embedding-based models significantly improves IHS detection in cross-dataset settings. This approach outperforms current state-of-the-art methods (Kim et al., 2024; Jiang, 2025; Kim et al., 2023; Yang et al., 2023) on several IHS datasets up to 1.10 percentage points for in-dataset evaluation and up to 20.35 percentage points for cross-dataset evaluation (F1-macro score). The significant improvement in cross-dataset evaluation is particularly noteworthy for generalization across datasets.

Our approach is significant because it simplifies the detection process and eliminates the need for (explicit) external knowledge. To the best of our knowledge, we are the first to use general-purpose LLM-based embeddings models for IHS detection. The code is available at https://github.com/idiap/implicit-hsd.

2. Related Work

Early research in hate speech detection primarily focused on identifying explicit abusive language through linguistic features, such as character n-grams (Waseem and Hovy, 2016) or word-centered features (i.e., literal words, part-of-speech tagging, occurrence of words within a word window) (Warner and Hirschberg, 2012). A combination of features such as TF-IDF weighted n-grams, part-of-speech tags, metadata including indicators for elements like hashtags and URLs, and number of characters and words was also used to train classifiers (Davidson et al., 2017; Schmidt and Wiegand, 2017). In (Del Vigna et al., 2017), the authors explore the combination of lexical and syntactic features with word sentiments and word embeddings. These models rely on phrase structure and fail to capture the complexity and subtlety of the language used in social media. Transformer-based models have improved the quality of classification (Mozafari et al., 2020a; Saleh et al., 2023). Later works (Sap et al., 2020; Davidson et al., 2019; Founta et al., 2018; Waseem et al., 2017; Jurgens et al., 2019; Qian et al., 2019) have emphasized the nuanced nature and complexity of implicit hate. Progress has been made in this area by focusing on specific types of implicit hate, such as euphemistic hate speech (Magu and Luo, 2018), sarcasm detection (Abu Farha et al., 2022), as well as through multi-task learning (Min et al., 2023; Plaza-Del-Arco et al., 2021; Awal et al., 2021; Mnassri et al., 2023; Jafari et al., 2023), external knowledge integration (Lin, 2022; Sridhar and Yang, 2022; Kim et al., 2023; Yang et al., 2023; Pérez et al., 2023) or contrastive learning-based methods (Ahn et al., 2024; Kim et al., 2024; Jiang, 2025; Ocampo et al., 2023).

Multi-task learning. Classifiers can be trained to detect hate speech jointly with secondary tasks. For example, as hate speech may relate to emotions (Fischer et al., 2018), a secondary task can be emotion classification (Min et al., 2023). Plaza-Del-Arco et al. (Plaza-Del-Arco et al., 2021) achieves promising results on binary hate speech detection by combining sentiment and emotion into their features. Awal et al. (Awal et al., 2021) employs a multitask learning approach to jointly learn hate speech detection with secondary tasks, such as emotion classification and hateful target identification. The authors use a BERT transformer (Devlin et al., 2019) to share knowledge between tasks and Bidirectional Long-Short Term Memory Networks to learn task-specific representation, followed up by a gated fusion mechanism. The authors base their approach on the intuition that datasets from relevant tasks can augment the hate speech data for the primary detection task. The method proposed in (Mnassri et al., 2023) leverages emotion recognition as an auxiliary task for both hate speech and offensive language detection, via a shared BERT-based encoder and task-specific classification heads. Similarly, Jafari et al. (Jafari et al., 2023) incorporates sentiment features alongside fine-grained emotion and textual features to improve the detection of IHS compared to single-task methods.

External knowledge. Recent research focuses on enhancing hate speech detection by integrating various forms of real-world external knowledge (entity linking (Lin, 2022), knowledge bases (Sridhar and Yang, 2022)). Lin et al. (Lin, 2022) links words appearing in tweets to their Wikipedia description and concatenates them with the original tweet before encoding. Sridhar et al. (Sridhar and Yang, 2022) combine explicit knowledge from knowledge bases with expert knowledge from high-quality annotation and LLM-generated knowledge to improve explanations of stereotypes in toxic speech. Kim et al. (Kim et al., 2022) and Kim et al. (Kim et al., 2023) propose methods that utilize external knowledge, such as implications of anchor sentences and synonym substitution or machine-generated statements, respectively, to improve IHS detection using contrastive learning. In (Yang et al., 2023), the authors incorporate explanations generated using chain-of-thought to better discern between hate and not hate and to improve generalization to unseen datasets. Pérez et al. (Pérez et al., 2023) also demonstrates that hateful messages directed at certain communities, such as the LGBTI community, may benefit from the addition of context. The authors show that incorporating contextual parent comments and the corresponding news articles can improve the detection of hate speech in responses to posts from media outlets.

Contrastive learning. Ahn et al. (Ahn et al., 2024) designed a clustering-based contrastive learning technique that uses shared semantics extracted from the data to learn discriminative representations. Specifically, the model is trained to pull together posts from the same cluster and push apart those from different clusters. This approach eliminates the need for costly human-annotated implications or machine-augmented data. Kim et al. (Kim et al., 2024) propose a contrastive learning-based approach that leverages hard negative samples to mitigate overfitting and improve generalization without relying on external knowledge. Building on this idea, Jiang et al. (Jiang, 2025) use prediction errors to select hard positive samples for contrastive learning to encourage the model to learn more robust representations to the spurious attributes that cause the misclassification.

Ocampo et al. (Ocampo et al., 2023) use contrastive learning to bridge the representation gap between explicit and implicit hate speech. The authors build upon the observation that explicit and implicit text representations, when grouped by their target groups, tend to cluster together. The method pushes closer together pairs of implicit and explicit messages sharing the same target group, while pushing apart negative pairs (hate and not hate instances). This leads to more meaningful embedding representations and better separations between not hate and hate instances. Masud et al. (Masud et al., 2024) proposes to improve IHS detection by aligning the surface form of implicit hate with its implied meaning and increasing inter-cluster separation in the latent space to better distinguish speech categories.

3. Models

3.1. Enhancing BERT-based classifiers

BERT (Devlin et al., 2019) and its variants such as RoBERTa (Liu et al., 2019) have been extensively used for text classification (Aragon et al., 2023). Hate speech detection works (Kim et al., 2023; Jiang, 2025; Ahn et al., 2024; Kim et al., 2024, 2022) predominantly use models such as BERT, RoBERTa, and T5 (Raffel et al., 2020). Table 1 shows a summary of the backbone architectures used by the most recent related works on IHS.

We enhance the BERT model by incorporating tweet-level emotion information and tweet-driven contextual information via dedicated modules. Our BERT-based classifiers consists of three main components, namely text analysis, emotion analysis, and context generation (see Figure 1).

Feature extraction. The text analysis module uses a fine-tuned BERT to extract the content of the tweet and represent it into an embedding vector. The emotion analysis module infers with a fine-tuned BERTweet (Pérez et al., 2023) a vector of probabilities across the following classes: fear, disgust, surprise, anger, sadness, joy, or other. Using a vector of probabilities instead of a single class allows the model to capture the complexity of the emotion. Understanding IHS relies heavily on contextual nuances. Capturing relevant context is made challenging by the short text length (tweets). Our context module leverages uncensored Llama2111https://huggingface.co/georgesung/llama2_7b_chat_uncensored to generate the associated context, avoiding safeguards that might prevent processing and generation of certain content. We prompt the LLM to produce a neutral and factual context, which may include historical background or descriptions of stereotypes concerning the target of the text:

Prompt: *As an educational assistant, your task is to provide neutral and objective analysis of the provided tweet, without any personal biases. Offer short and concise information, context, and concepts to understand the content of the tweet without bias. The tweet may originate from different extremist groups, including White Nationalist, Neo-Nazi, Anti-Immigrant, Anti-Muslim, Anti-LGBTQ, KKK as well as non-extremist sources. The tweet could contain sarcasm, stereotypes, satire, metaphor, irony, or misinformation. Remember to avoid injecting personal opinions or interpretations into your analysis. Your aim is to provide a neutral understanding of the tweet’s content within a maximum of 150 words. *

The final prompt is [Prompt. ”Tweet to analyze: ”, ¡Original tweet¿.]. We explicitly ask for an objective and neutral analysis to try to avoid bias from the data Llama2 was trained on. We also give a context about the dataset that is used so the LLM has a starting point (see Appendix A for examples of generated context). The generated context is then used by RoBERTa to extract features.

Feature fusion. We explore four feature fusion approaches, namely concatenation, adaptive fusion, mixture of experts, and shared learnable query. With concatenation, we classify with a two-layer perceptron (MLP) the outputs of the three modules stringed together. The first layer of the MLP has the same size as the concatenated embeddings (1543), whereas the second layer contains 2 nodes for the binary classes.

With adaptive fusion, we learn the parameters $\alpha_{\textit{tweet}}$ , $\alpha_{\textit{context}}$ , and $\alpha_{\textit{emotion}}$ that determine the scaling of each feature component. In order to maintain reasonable magnitude in the inputs, we constrain to $[-1,1]$ the learnable parameters with a sigmoid. With a simple mixture of experts, given a short text input, we utilize a simple MLP followed by a softmax layer to generate three adjustable feature scaling factors: $\alpha_{\textit{tweet}}$ , $\alpha_{\textit{context}}$ , $\alpha_{\textit{emotion}}$ . The key distinction from adaptive fusion lies in the ability to tailor these scaling parameters specifically for each input, whereas adaptive fusion employs a fixed set of scaling parameters across all samples in the test dataset. Finally, for the shared learnable query, we use a multi-head attention with a shared learnable query, where keys and values are derived from both content and context embeddings. The query is a learnable parameter that is the same for both the content and context. The outputs of the multi-head attention blocks are then concatenated along with the emotion vector and fed to the classifier.

3.2. Specializing generalist embeddings

General text embedding models, such as Stella (Zhang et al., 2025), E5 (Wang et al., 2024b), NV-Embed (Lee et al., 2025), and Jasper (Zhang et al., 2025) are the result of numerous improvements over BERT (Devlin et al., 2019) and RoBERTa (Liu et al., 2019). Several factors contribute to the better performance of newer embedding models compared to BERT. First, the embedding models are trained on a bigger volume of data than BERT, enabling them to capture more diverse linguistic patterns and contextual nuances. Secondly, techniques such as hard-negative, in-batch negative and contrastive learning in general appear to provide better embeddings for classification even without a task specific pipeline for classification. E5 (Wang et al., 2024b) is initialized from XLM-RoBERTa-large (Conneau et al., 2020) and results from curated datasets and contrastive learning with mined hard negatives. NV-Embed (Lee et al., 2025), a fine-tuned version of Mistral 7B (Jiang et al., 2023), is trained with contrastive learning using in-batch hard negatives and uses a latent attention layer to produce embeddings. Stella (Zhang et al., 2025) is based on mGTE (Zhang et al., 2024b) and the general text embedding variant of Qwen2 (Yang et al., 2024b) where a final training involves matryoshka representation learning (MRL) (Kusupati et al., 2024) which makes it performant at different embedding sizes. Jasper (Zhang et al., 2025) uses a distillation of multiple teachers (Zhang et al., 2025; Lee et al., 2025) and is augmented with multi-modal capabilities through a final training stage where image-caption pairs are used with SigLIP (Tschannen et al., 2025) as the image encoder. These models also come in different sizes, with E5-large at 560 million parameters, Stella at 1.5 billion, Jasper at 2 billion, and NV-Embed at 7 billion.

To remove instruction bias, all models are fine-tuned using the following instruction template: Instruct: classify the following in no hate or hate.\nQuery:. The instruction is prepended to the short text that is being classified and then passed to the general text embedding model. Each model produces embeddings in $\mathbb{R}^{k\times n}$ whose dimensions depend on their specific implementation and the input length $k$ . Following the recommendations provided by the model authors 222https://huggingface.co/NovaSearch/stella_en_1.5B_v5 333https://huggingface.co/intfloat/e5-large, we combine these embeddings into a single representation using a normalized sum over the token dimension. NV-Embed uses mean pooling as part of its final layer, we therefore use the output directly. This results in a final embedding vector in $\mathbb{R}^{n}$ , which is subsequently fed into the classification module, which consists of a two-layer MLP with a hidden layer of size $n$ and LeakyReLU activations. The MLP ultimately reduces the dimensionality to 2 for classification (see Figure 2).

To contrast the results of our embeddings-based classifiers, we compare them with linear probing (i.e., only the classification module is optimized) and to recent generative models, such as Llama3-8B (Grattafiori et al., 2024), Gemma-7B (Team et al., 2024), and Qwen3-8B (Yang et al., 2025). For these LLMs, we take the average over the last hidden state as our embeddings (Wang et al., 2024a) which are then fed to the same classification module as for the generalist embedding models. We fine-tune the whole pipeline.

4. Validation

4.1. Datasets

To quantify the performance of the classifiers, we employ four commonly used IHS datasets that cover a variety of contexts and nuances of real-world scenarios. The distribution of labels in each dataset is reported in Table 4.1.

Implicit Hate Corpus (IHC) (ElSherief et al., 2021). This dataset consists of tweets collected between 2015 and 2017 from accounts of US extremist groups, including Black Separatist, White Nationalist, Neo Nazi, Anti-Muslim, Racist Skinhead, Ku Klux Klan, Anti-LGBT and Anti-Immigrant. Most of their speech targets minorities or specific groups of people. The samples are labeled as explicit hate, implicit hate, or not hate. It is important to note the class imbalance in this dataset: 13206 tweets are not hate and 5460 contain implicit hate. Following (Kim et al., 2022, 2024), we only used the implicit hate samples in the dataset as the hate class, meaning that we do not use the explicit hate samples. An example of not hate sample is: ”i have no idea what you are talking about. white supremacy = pure evil”. An example of implicit hate sample is ”#hannahcornelius - why not come home to #europe whites will never be welcome in #southafrica”.

DynaHate (Vidgen et al., 2021). This dataset is built with an iterative process between a model and human annotators who progressively generate more challenging examples to trick the model (i.e., by flipping labels with minimal changes to the original post). The examples that are successful in tricking the model are then added to the training set. The model used for classification is RoBERTa with a sequence classification head, which is used to evaluate the difficulty of samples. The labeling includes hate/not hate, type of hate (e.g., threat, dehumanization), and target of hate. There are 41,255 entries, with $54\%$ of them labeled as hate.

SBIC (Sap et al., 2020). This dataset contains social media posts from Reddit and Twitter with implicit social biases, stereotypes, and power dynamics in language. It was annotated by Amazon Mechanical Turk workers. The main labels contain: offensive/not offensive/maybe offensive, and secondary labels and annotations are: intend to offend, sexual content, group/individual targeting, targeted group, implied statement, in-group language (target of the same group as the writer). We follow (Kim et al., 2022) and classify the text as hate if the aggregated score for offensiveness is equal to or above $0.5$ .

ToxiGen (Hartvigsen et al., 2022). This is a machine-generated dataset with toxic and benign statements about 13 minorities (e.g., African Americans, women, LGBTQ+). A subset of the generated data is validated by human annotators in terms of difficulty and toxicity. We use this subset, which is composed of 8960 training samples with 3368 being hate, 1792 validation samples with 638 hate, and 940 test samples among which 406 are hate. We use the split provided by the authors. We follow the indication from the official implementation444https://github.com/microsoft/TOXIGEN and label a sample as hate if the sum of the toxicity score given by both the human and the model exceeds $5.5$ .

Bibliography68

The reference list from the paper itself. Each links out to its DOI / PubMed record.

1(1)
2Abu Farha et al . (2022) Ibrahim Abu Farha, Silviu V. Oprea, Steven Wilson, and Walid Magdy. 2022. Sem Eval-2022 Task 6: i Sarcasm Eval, Intended Sarcasm Detection in English and Arabic. In Proceedings of the 16th International Workshop on Semantic Evaluation (Sem Eval-2022) . Association for Computational Linguistics, Seattle, United States, 802–814. doi: 10.18653/v 1/2022.semeval-1.111 · doi ↗
3Ahn et al . (2024) Hyeseon Ahn, Youngwook Kim, Jungin Kim, and Yo-Sub Han. 2024. Shared Con: Implicit Hate Speech Detection using Shared Semantics. In Findings of the Association for Computational Linguistics ACL 2024 . Association for Computational Linguistics, Bangkok, Thailand and virtual meeting, 10444–10455. doi: 10.18653/v 1/2024.findings-acl.622 · doi ↗
4Aragon et al . (2023) Mario Aragon, Adrian P. Lopez Monroy, Luis Gonzalez, David E. Losada, and Manuel Montes. 2023. Disor BERT: A Double Domain Adaptation Model for Detecting Signs of Mental Disorders in Social Media. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) . Association for Computational Linguistics, Toronto, Canada, 15305–15318. doi: 10.18653/v 1/2023.acl-long.853 · doi ↗
5Awal et al . (2021) Md Rabiul Awal, Rui Cao, Roy Ka-Wei Lee, and Sandra Mitrović. 2021. Angry BERT: Joint Learning Target and Emotion for Hate Speech Detection. In Advances in Knowledge Discovery and Data Mining: 25th Pacific-Asia Conference, PAKDD 2021, Virtual Event, May 11–14, 2021, Proceedings, Part I . Springer-Verlag, Berlin, Heidelberg, 701–713. doi: 10.1007/978-3-030-75762-5_55 · doi ↗
6Basile et al . (2019) Valerio Basile, Cristina Bosco, Elisabetta Fersini, Debora Nozza, Viviana Patti, Francisco M. Rangel Pardo, Paolo Rosso, and Manuela Sanguinetti. 2019. Sem Eval-2019 Task 5: Multilingual Detection of Hate Speech Against Immigrants and Women in Twitter. In Proceedings of the 13th International Workshop on Semantic Evaluation . Association for Computational Linguistics, Minneapolis, Minnesota, USA, 54–63. doi: 10.18653/v 1/S 19-2007 · doi ↗
7Conneau et al . (2020) Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Association for Computational Linguistics, Online, 8440–8451. doi: 10.18653/v 1/2020.acl-main.747 · doi ↗
8Damo et al . (2024) Greta Damo, Nicolás B. Ocampo, Elena Cabrio, and Serena Villata. 2024. Unveiling the Hate: Generating Faithful and Plausible Explanations for Implicit and Subtle Hate Speech Detection. In Natural Language Processing and Information Systems . Springer Nature Switzerland, Cham, 211–225.