HebID: Detecting Social Identities in Hebrew-language Political Text
Guy Mor-Lan, Naama Rivlin-Angert, Yael R. Kaplan, Tamir Sheafer, Shaul R. Shenhav

TL;DR
HebID is a new multilabel Hebrew dataset for detecting nuanced social identities in political texts, enabling analysis of identity expression and differences between elite discourse and public priorities.
Contribution
It introduces the first Hebrew corpus for social identity detection, benchmarks models including LLMs, and applies analysis to political discourse and public survey data.
Findings
Hebrew-tuned LLMs achieve macro-F1 of 0.74
Identifies gender and temporal variations in identity expression
Reveals differences between elite discourse and public identity priorities
Abstract
Political language is deeply intertwined with social identities. While social identities are often shaped by specific cultural contexts and expressed through particular uses of language, existing datasets for group and identity detection are predominantly English-centric, single-label and focus on coarse identity categories. We introduce HebID, the first multilabel Hebrew corpus for social identity detection: 5,536 sentences from Israeli politicians' Facebook posts (Dec 2018-Apr 2021), manually annotated for twelve nuanced social identities (e.g. Rightist, Ultra-Orthodox, Socially-oriented) grounded by survey data. We benchmark multilabel and single-label encoders alongside 2B-9B-parameter generative LLMs, finding that Hebrew-tuned LLMs provide the best results (macro- = 0.74). We apply our classifier to politicians' Facebook posts and parliamentary speeches, evaluating differences…
Peer Reviews
No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.
Videos
Taxonomy
TopicsComputational and Text Analysis Methods · Sentiment Analysis and Opinion Mining · Authorship Attribution and Profiling
