The Importance of Suppressing Domain Style in Authorship Analysis

Sebastian Bischoff; Niklas Deckers; Marcel Schliebs; Ben Thies,; Matthias Hagen; Efstathios Stamatatos; Benno Stein; Martin Potthast

arXiv:2005.14714·cs.CL·June 1, 2020·20 cites

The Importance of Suppressing Domain Style in Authorship Analysis

Sebastian Bischoff, Niklas Deckers, Marcel Schliebs, Ben Thies,, Matthias Hagen, Efstathios Stamatatos, Benno Stein, Martin Potthast

PDF

Open Access 1 Datasets

TL;DR

This paper investigates how domain-specific styles influence authorship analysis and demonstrates that domain-adversarial learning significantly improves robustness against domain shifts, outperforming heuristic methods.

Contribution

It introduces a novel experimental setup for assessing domain influence in authorship analysis and proposes effective domain-adversarial learning techniques to mitigate domain effects.

Findings

01

Character trigram features are highly affected by domain changes.

02

Domain-adversarial learning reduces accuracy loss to under 4%.

03

Heuristic domain-removal methods are less effective than learned approaches.

Abstract

The prerequisite of many approaches to authorship analysis is a representation of writing style. But despite decades of research, it still remains unclear to what extent commonly used and widely accepted representations like character trigram frequencies actually represent an author's writing style, in contrast to more domain-specific style components or even topic. We address this shortcoming for the first time in a novel experimental setup of fixed authors but swapped domains between training and testing. With this setup, we reveal that approaches using character trigram features are highly susceptible to favor domain information when applied without attention to domains, suffering drops of up to 55.4 percentage points in classification accuracy under domain swapping. We further propose a new remedy based on domain-adversarial learning and compare it to ones from the literature based…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Code & Models

Datasets

swan07/authorship-verification
dataset· 268 dl
268 dl

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.

Taxonomy

TopicsAuthorship Attribution and Profiling · Names, Identity, and Discrimination Research · Hate Speech and Cyberbullying Detection