Separating Style from Substance: Enhancing Cross-Genre Authorship   Attribution through Data Selection and Presentation

Steven Fincke; Elizabeth Boschee

arXiv:2408.05192·cs.CL·August 12, 2024

Separating Style from Substance: Enhancing Cross-Genre Authorship Attribution through Data Selection and Presentation

Steven Fincke, Elizabeth Boschee

PDF

TL;DR

This paper introduces data selection techniques and a curriculum learning approach to improve cross-genre authorship attribution by reducing topic influence, resulting in significant performance gains.

Contribution

It presents novel methods for training data selection and curriculum design that enhance model focus on stylistic features over topical cues.

Findings

01

62.7% relative improvement in cross-genre attribution

02

16.6% improvement within individual genres

03

Effective reduction of topic influence in authorship attribution

Abstract

The task of deciding whether two documents are written by the same author is challenging for both machines and humans. This task is even more challenging when the two documents are written about different topics (e.g. baseball vs. politics) or in different genres (e.g. a blog post vs. an academic article). For machines, the problem is complicated by the relative lack of real-world training examples that cross the topic boundary and the vanishing scarcity of cross-genre data. We propose targeted methods for training data selection and a novel learning curriculum that are designed to discourage a model's reliance on topic information for authorship attribution and correspondingly force it to incorporate information more robustly indicative of style no matter the topic. These refinements yield a 62.7% relative improvement in average cross-genre authorship attribution, as well as 16.6% in…

Peer Reviews

No public reviews on file for this paper yet. If you reviewed it on a platform where reviews are public (OpenReview, ICLR, NeurIPS, ICML), you can paste yours below so the community can read it here.

Videos

No videos yet. Explain this paper in a talk, walkthrough, or lecture? Add one.