Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation

Maximilian Maurer; Maximilian Linde; Gabriella Lapesa

arXiv:2605.06318·cs.CL·May 8, 2026

Who and What? Using Linguistic Features and Annotator Characteristics to Analyze Annotation Variation

Maximilian Maurer, Maximilian Linde, Gabriella Lapesa

PDF

TL;DR

This paper analyzes how annotator characteristics and linguistic features interact to influence annotation variation in harmful language detection datasets, revealing complex intersectional effects and dataset-specific patterns.

Contribution

It provides the first large-scale analysis combining annotator traits, linguistic properties, and their interactions, highlighting the importance of these factors in understanding annotation variability.

Findings

01

Interactions between annotator traits and linguistic features are crucial.

02

Lexical cues and annotator attitudes significantly influence annotations.

03

Effect patterns differ across datasets, cautioning against overgeneralization.

Abstract

Human label variation has been established as a central phenomenon in NLP: the perspectives different annotators have on the same item need to be embraced. Data collection practices thus shifted towards increasing the annotator numbers and releasing disaggregated datasets, harmful language being most resourced due to its high subjectivity. While this resulted in rich information about \textit{who} annotated (sociodemographics, attitudes, etc.), the \textit{what} (e.g., linguistic properties of items), and their interplay has received little attention. We present the first large-scale analysis of four reference datasets for harmful language detection, bringing together annotator characteristics, linguistic properties of the items, and their interactions in a statistically informed picture. We find that interactions are crucial, revealing intersectional effects ignored in previous work,…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.